Regression Week 2: Multiple Regression (Interpretation)

The goal of this first notebook is to explore multiple regression and feature engineering with existing graphlab functions.

In this notebook you will use data on house sales in King County to predict prices using multiple regression. You will:

  • Use SFrames to do some feature engineering
  • Use built-in graphlab functions to compute the regression weights (coefficients/parameters)
  • Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
  • Look at coefficients and interpret their meanings
  • Evaluate multiple models via RSS

Fire up graphlab create

In [42]:
import graphlab

Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [43]:
sales = graphlab.SFrame('kc_house_data.gl/')
In [44]:
#Explore data
graphlab.canvas.set_target('ipynb')
print 'Number of columns: ' + str(sales.num_cols)
print 'Number of data items: ' + str(len(sales))
sales.show(view="Table")
Number of columns: <bound method SFrame.num_cols of Columns:
	id	str
	date	datetime
	price	float
	bedrooms	float
	bathrooms	float
	sqft_living	float
	sqft_lot	int
	floors	str
	waterfront	int
	view	int
	condition	int
	grade	int
	sqft_above	int
	sqft_basement	int
	yr_built	int
	yr_renovated	int
	zipcode	str
	lat	float
	long	float
	sqft_living15	float
	sqft_lot15	float

Rows: 21613

Data:
+------------+---------------------------+-----------+----------+-----------+
|     id     |            date           |   price   | bedrooms | bathrooms |
+------------+---------------------------+-----------+----------+-----------+
| 7129300520 | 2014-10-13 00:00:00+00:00 |  221900.0 |   3.0    |    1.0    |
| 6414100192 | 2014-12-09 00:00:00+00:00 |  538000.0 |   3.0    |    2.25   |
| 5631500400 | 2015-02-25 00:00:00+00:00 |  180000.0 |   2.0    |    1.0    |
| 2487200875 | 2014-12-09 00:00:00+00:00 |  604000.0 |   4.0    |    3.0    |
| 1954400510 | 2015-02-18 00:00:00+00:00 |  510000.0 |   3.0    |    2.0    |
| 7237550310 | 2014-05-12 00:00:00+00:00 | 1225000.0 |   4.0    |    4.5    |
| 1321400060 | 2014-06-27 00:00:00+00:00 |  257500.0 |   3.0    |    2.25   |
| 2008000270 | 2015-01-15 00:00:00+00:00 |  291850.0 |   3.0    |    1.5    |
| 2414600126 | 2015-04-15 00:00:00+00:00 |  229500.0 |   3.0    |    1.0    |
| 3793500160 | 2015-03-12 00:00:00+00:00 |  323000.0 |   3.0    |    2.5    |
+------------+---------------------------+-----------+----------+-----------+
+-------------+----------+--------+------------+------+-----------+-------+------------+
| sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above |
+-------------+----------+--------+------------+------+-----------+-------+------------+
|    1180.0   |   5650   |   1    |     0      |  0   |     3     |   7   |    1180    |
|    2570.0   |   7242   |   2    |     0      |  0   |     3     |   7   |    2170    |
|    770.0    |  10000   |   1    |     0      |  0   |     3     |   6   |    770     |
|    1960.0   |   5000   |   1    |     0      |  0   |     5     |   7   |    1050    |
|    1680.0   |   8080   |   1    |     0      |  0   |     3     |   8   |    1680    |
|    5420.0   |  101930  |   1    |     0      |  0   |     3     |   11  |    3890    |
|    1715.0   |   6819   |   2    |     0      |  0   |     3     |   7   |    1715    |
|    1060.0   |   9711   |   1    |     0      |  0   |     3     |   7   |    1060    |
|    1780.0   |   7470   |   1    |     0      |  0   |     3     |   7   |    1050    |
|    1890.0   |   6560   |   2    |     0      |  0   |     3     |   7   |    1890    |
+-------------+----------+--------+------------+------+-----------+-------+------------+
+---------------+----------+--------------+---------+-------------+
| sqft_basement | yr_built | yr_renovated | zipcode |     lat     |
+---------------+----------+--------------+---------+-------------+
|       0       |   1955   |      0       |  98178  | 47.51123398 |
|      400      |   1951   |     1991     |  98125  | 47.72102274 |
|       0       |   1933   |      0       |  98028  | 47.73792661 |
|      910      |   1965   |      0       |  98136  |   47.52082  |
|       0       |   1987   |      0       |  98074  | 47.61681228 |
|      1530     |   2001   |      0       |  98053  | 47.65611835 |
|       0       |   1995   |      0       |  98003  | 47.30972002 |
|       0       |   1963   |      0       |  98198  | 47.40949984 |
|      730      |   1960   |      0       |  98146  | 47.51229381 |
|       0       |   2003   |      0       |  98038  | 47.36840673 |
+---------------+----------+--------------+---------+-------------+
+---------------+---------------+-----+
|      long     | sqft_living15 | ... |
+---------------+---------------+-----+
| -122.25677536 |     1340.0    | ... |
|  -122.3188624 |     1690.0    | ... |
| -122.23319601 |     2720.0    | ... |
| -122.39318505 |     1360.0    | ... |
| -122.04490059 |     1800.0    | ... |
| -122.00528655 |     4760.0    | ... |
| -122.32704857 |     2238.0    | ... |
| -122.31457273 |     1650.0    | ... |
| -122.33659507 |     1780.0    | ... |
|  -122.0308176 |     2390.0    | ... |
+---------------+---------------+-----+
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.>
Number of data items: 21613