{ "cells": [ { "cell_type": "markdown", "id": "c49b5055-cb25-4563-966d-430aebe0f434", "metadata": {}, "source": [ "# Salary data" ] }, { "cell_type": "code", "execution_count": 7, "id": "efeae42d-4556-483c-99ae-06e7f81216a2", "metadata": {}, "outputs": [], "source": [ "# Module\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "# sklearn\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "code", "execution_count": 8, "id": "7a0abee4-5bb6-4188-9613-94b9755bd8c7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
See Full Dataframe in Mito
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearsExperienceSalary
01.118734.761905
11.322002.380952
21.517967.142857
32.020726.190476
42.218995.714286
" ], "text/plain": [ " YearsExperience Salary\n", "0 1.1 18734.761905\n", "1 1.3 22002.380952\n", "2 1.5 17967.142857\n", "3 2.0 20726.190476\n", "4 2.2 18995.714286" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import data\n", "dataset = pd.read_excel('SalaryData.xlsx')\n", "dataset.head()" ] }, { "cell_type": "markdown", "id": "25e81c8a-4aea-4927-8ec5-969b7bf801a3", "metadata": {}, "source": [ "# Data Preprocessing\n", "Now that we have imported the dataset, we will perform data preprocessing." ] }, { "cell_type": "code", "execution_count": 9, "id": "9e8485cf-7733-4a0f-829f-d28dd4194f28", "metadata": {}, "outputs": [], "source": [ "X = dataset.iloc[:,:-1].values # Independent variable array\n", "y = dataset.iloc[:,1].values # Dependent variable vector" ] }, { "cell_type": "code", "execution_count": 5, "id": "9f7031a9-d52a-4624-b09d-0d2ad0ac8f3a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([18734.76190476, 22002.38095238, 17967.14285714, 20726.19047619,\n", " 18995.71428571, 26972.38095238, 28642.85714286, 25926.19047619,\n", " 30688.0952381 , 27232.85714286, 30103.80952381, 26568.57142857,\n", " 27122.38095238, 27181.42857143, 29100.47619048, 32351.42857143,\n", " 31442.38095238, 39565.71428571, 38744.28571429, 44733.33333333,\n", " 43684.76190476, 46796.66666667, 48239.04761905, 54196.19047619,\n", " 52110. , 50277.14285714, 55699.52380952, 53635.71428571,\n", " 58281.42857143, 58034.28571429])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y" ] }, { "cell_type": "markdown", "id": "195b485f-1595-4602-a0f3-c1a7f6698dfd", "metadata": {}, "source": [ "The X is independent variable array and y is the dependent variable vector. Note the difference between the array and vector. The dependent variable must be in vector and independent variable must be an array itself." ] }, { "cell_type": "markdown", "id": "8c87bbe5-42de-4181-b5d1-6dd66d855f41", "metadata": {}, "source": [ "# Splitting the dataset\n", "We need to split our dataset into the test and train set. Generally, we follow the 20-80 policy or the 30-70 policy respectively.\n", "\n", "Why is it necessary to perform splitting? This is because we wish to train our model according to the years and salary. We then test our model on the test set.\n", "\n", "We check whether the predictions made by the model on the test set data matches what was given in the dataset.\n", "\n", "If it matches, it implies that our model is accurate and is making the right predictions." ] }, { "cell_type": "code", "execution_count": 10, "id": "f5ecaa22-ae4d-424e-a294-dc34578ea3ee", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=1/3,random_state=0)" ] }, { "cell_type": "markdown", "id": "bdc15e9d-7ce9-4e11-a83d-10172cf16c32", "metadata": {}, "source": [ "# Fitting linear regression model into the training set\n", "From sklearn’s linear model library, import linear regression class. Create an object for a linear regression class called regressor.\n", "\n", "To fit the regressor into the training set, we will call the fit method – function to fit the regressor into the training set.\n", "\n", "We need to fit X_train (training data of matrix of features) into the target values y_train. Thus the model learns the correlation and learns how to predict the dependent variables based on the independent variable." ] }, { "cell_type": "code", "execution_count": 12, "id": "748273c6-8ef6-4f68-9c64-c3dd154acce5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LinearRegression()" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regressor = LinearRegression()\n", "regressor.fit(X_train, y_train) # Produces the linear eqn for the data" ] }, { "cell_type": "markdown", "id": "c62e817f-4400-408b-83bc-3f987f823f5d", "metadata": {}, "source": [ "# Predicting the test set results\n", "We create a vector containing all the predictions of the test set salaries. The predicted salaries are then put into the vector called y_pred.(contains prediction for all observations in the test set)\n", "\n", "predict method makes the predictions for the test set. Hence, the input is the test set. The parameter for predict must be an array or sparse matrix, hence input is X_test." ] }, { "cell_type": "code", "execution_count": 13, "id": "337819b1-a216-46db-8ff1-d600e5e9abe6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([19445.28852796, 58609.23781342, 31016.4553623 , 30126.36560581,\n", " 55048.87878747, 51488.51976152, 55493.92366572, 30571.41048406,\n", " 36356.99390123, 47928.16073557])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred = regressor.predict(X_test) \n", "y_pred" ] }, { "cell_type": "code", "execution_count": 7, "id": "9d2862a3-dc06-4863-9fcc-2608d465038a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([17967.14285714, 58281.42857143, 27181.42857143, 30103.80952381,\n", " 55699.52380952, 52110. , 53635.71428571, 26568.57142857,\n", " 39565.71428571, 48239.04761905])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test" ] }, { "cell_type": "markdown", "id": "a1b68452-54ed-49d9-8683-cdca10fa0835", "metadata": {}, "source": [ "- y_test is the real salary of the test set.\n", "- y_pred are the predicted salaries." ] }, { "cell_type": "markdown", "id": "0823a81e-d6ff-40a0-ac10-7244c908d165", "metadata": {}, "source": [ "# Visualizing the results\n", "Let’s see what the results of our code will look like when we visualize it.\n", "\n", "## Plotting the points (observations)\n", "To visualize the data, we plot graphs using matplotlib. To plot real observation points ie plotting the real given values.\n", "\n", "The X-axis will have years of experience and the Y-axis will have the predicted salaries.\n", "\n", "plt.scatter plots a scatter plot of the data. Parameters include :\n", "\n", "- X – coordinate (X_train: number of years)\n", "- Y – coordinate (y_train: real salaries of the employees)\n", "- Color ( Regression line in red and observation line in blue)\n", "\n", "## Plotting the regression line\n", "plt.plot have the following parameters :\n", "\n", "- X coordinates (X_train) – number of years\n", "- Y coordinates (predict on X_train) – prediction of X-train (based on a number of years).\n", "\n", "Note : *The y-coordinate is not y_pred because y_pred is predicted salaries of the test set observations.*" ] }, { "cell_type": "code", "execution_count": 14, "id": "0a0e3eb3-402c-42f0-a4f0-0d2e3a226b76", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot for the Training set\n", "plt.scatter(X_train, y_train, color='red') # Plotting the observation \n", "plt.plot(X_train, regressor.predict(X_train), color='blue') # Plotting the regression line\n", "plt.title(\"Salary vs Experience (Training set)\")\n", "plt.xlabel(\"Years of experience\")\n", "plt.ylabel(\"Salaries\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 15, "id": "c03f74cd-293b-4b3b-89a4-533f80c719ac", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot for the Testing set\n", "plt.scatter(X_test, y_test, color='red') # Plotting the observation \n", "plt.plot(X_train, regressor.predict(X_train), color='blue') # Plotting the regression line\n", "plt.title(\"Salary vs Experience (Testing set)\")\n", "plt.xlabel(\"Years of experience\") \n", "plt.ylabel(\"Salaries\") \n", "plt.show() " ] }, { "cell_type": "code", "execution_count": null, "id": "ce4628b2-75f2-418e-ae25-9ece6a5b7674", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }