{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"u'Connected: None@PS1.db'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%load_ext sql\n",
"%sql sqlite:///PS1.db"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem Set 1\n",
"=======\n",
"\n",
"\n",
"### Instructions / Notes:\n",
"\n",
"**_Read these carefully_**\n",
"\n",
"* Run the top cell above to load the database `PS1.db` (make sure the actual database file, `PS1.db`, is in the same directory as this IPython notebook is running in)\n",
"* Some of the problems involve _changing_ this database (e.g. deleting rows)- you can always re-download `PS1.db` or make a copy if you want to start fresh!\n",
"* You **may** create new IPython notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!- **just make sure that your final answer for each question is _in its own cell_ and _clearly indicated_**\n",
"* When you see `In [*]:` to the left of the cell you are executing, this means that the code / query is _running_.\n",
" * **If the cell is hanging- i.e. running for too long: To restart the SQL connection, you must restart the entire python kernel**\n",
" * To restart kernel using the menu bar: \"Kernel >> Restart >> Clear all outputs & restart\"), then re-execute the sql connection cell at top\n",
" * You will also need to restart the connection if you want to load a different version of the database file\n",
"* Remember:\n",
" * `%sql [SQL]` is for _single line_ SQL queries\n",
" * `%%sql [SQL]` is for _multi line_ SQL queries\n",
"* **See Piazza for submission instructions**\n",
"* _Have fun!_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 1: Linear Algebra\n",
"------------------------\n",
"\n",
"Two random 3x3 ($N=3$) matrices have been provided in tables `A` and `B`, having the following schema:\n",
"> * `i INT`: Row index\n",
"> * `j INT`: Column index\n",
"> * `val INT`: Cell value\n",
"\n",
"**Note: all of your answers below _must_ work for any _square_ matrix sizes, i.e. any value of $N$**.\n",
"\n",
"Note how the matrices are represented- why do we choose this format? Run the following queries to see the matrices in a nice format:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"
\n",
" \n",
" A | \n",
"
\n",
" \n",
" 7 , 5 , 8 | \n",
"
\n",
" \n",
" 10 , 7 , 7 | \n",
"
\n",
" \n",
" 2 , 0 , 5 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'7 , 5 , 8',), (u'10 , 7 , 7',), (u'2 , 0 , 5',)]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT group_concat(val, \" , \") AS \"A\" FROM A GROUP BY i;"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" B | \n",
"
\n",
" \n",
" 9 , 6 , 10 | \n",
"
\n",
" \n",
" 7 , 6 , 9 | \n",
"
\n",
" \n",
" 1 , 1 , 7 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'9 , 6 , 10',), (u'7 , 6 , 9',), (u'1 , 1 , 7',)]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT group_concat(val, \" , \") AS \"B\" FROM B GROUP BY i;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a): Matrix transpose\n",
"\n",
"_Transposing_ a matrix $A$ is the operation of switching its rows with its columns- written $A^T$. For example, if we have:\n",
"\n",
"$A=\\begin{bmatrix}a & b & c\\\\ d & e & f\\\\ g & h & i\\end{bmatrix}$\n",
"\n",
"Then:\n",
"\n",
"$A^T=\\begin{bmatrix}a & d & g\\\\ b & e & h\\\\ c & f & i\\end{bmatrix}$\n",
"\n",
"Write a _single SQL query_ to get the matrix transpose $A^T$ (in the same format as $A$- output tuples should be of format `(i,j,val)` where `i` is row, `j` is column, and the output is ordered by row then column index)\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" i | \n",
" j | \n",
" val | \n",
"
\n",
" \n",
" 0 | \n",
" 0 | \n",
" 7 | \n",
"
\n",
" \n",
" 0 | \n",
" 1 | \n",
" 10 | \n",
"
\n",
" \n",
" 0 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 8 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 7 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
"
"
],
"text/plain": [
"[(0, 0, 7),\n",
" (0, 1, 10),\n",
" (0, 2, 2),\n",
" (1, 0, 5),\n",
" (1, 1, 7),\n",
" (1, 2, 0),\n",
" (2, 0, 8),\n",
" (2, 1, 7),\n",
" (2, 2, 5)]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT A.j AS \"i\", A.i AS \"j\", A.val FROM A ORDER BY i,j;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b): Dot product I\n",
"\n",
"The _dot product_ of two vectors\n",
"\n",
"$a = \\begin{bmatrix}a_1 & a_2 & \\dots & a_n\\end{bmatrix}$\n",
"\n",
"and\n",
"\n",
"$b = \\begin{bmatrix}b_1 & b_2 & \\dots & b_n\\end{bmatrix}$\n",
"\n",
"is\n",
"\n",
"$a\\cdot b = \\sum_{i=1}^n a_ib_i = a_1b_1 + a_2b_2 + \\dots + a_nb_n$\n",
"\n",
"Write a _single SQL query_ to take the dot product of the **second column of $A$** and the **third column of $B$.**\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" SUM(A.val * B.val) | \n",
"
\n",
" \n",
" 113 | \n",
"
\n",
"
"
],
"text/plain": [
"[(113,)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT SUM(A.val * B.val) \n",
"FROM A, B\n",
"WHERE A.i = B.i AND A.j = 1 AND B.j = 2;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c): Dot product II\n",
"\n",
"Write a _single SQL query_ to take the dot product of the **second _row_ of $A$** and the **third column of $B$.**\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" SUM(A.val * B.val) | \n",
"
\n",
" \n",
" 212 | \n",
"
\n",
"
"
],
"text/plain": [
"[(212,)]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT SUM(A.val * B.val)\n",
"FROM A, B\n",
"WHERE A.j = B.i AND A.i = 1 AND B.j = 2;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (d): Matrix multiplication\n",
"\n",
"The product of a matrix $A$ (having dimensions $n\\times m$) and a matrix $B$ (having dimensions $m\\times p$) is the matrix $C$ (of dimension $n\\times p$) having cell at row $i$ and column $j$ equal to:\n",
"\n",
"$C_{ij} = \\sum_{k=1}^m A_{ik}B_{kj}$\n",
"\n",
"In other words, to multiply two matrices, get each cell of the resulting matrix $C$, $C_{ij}$, by taking the _dot product_ of the $i$th row of $A$ and the $j$th column of $B$.\n",
"\n",
"Write a single SQL query to get the matrix product of $A$ and $B$ (in the same format as $A$ and $B$).\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" i | \n",
" j | \n",
" val | \n",
"
\n",
" \n",
" 0 | \n",
" 0 | \n",
" 106 | \n",
"
\n",
" \n",
" 0 | \n",
" 1 | \n",
" 80 | \n",
"
\n",
" \n",
" 0 | \n",
" 2 | \n",
" 171 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 146 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 109 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 212 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 23 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 17 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 55 | \n",
"
\n",
"
"
],
"text/plain": [
"[(0, 0, 106),\n",
" (0, 1, 80),\n",
" (0, 2, 171),\n",
" (1, 0, 146),\n",
" (1, 1, 109),\n",
" (1, 2, 212),\n",
" (2, 0, 23),\n",
" (2, 1, 17),\n",
" (2, 2, 55)]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT A.i, B.j, SUM(A.val * B.val) AS val\n",
"FROM A,B\n",
"WHERE A.j = B.i\n",
"GROUP BY A.i, B.j;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 2: U.S. Hourly NOAA Precipitation dataset\n",
"----------------------------------------------\n",
"\n",
"We've prepared and loaded a [public dataset](https://catalog.data.gov/dataset/u-s-hourly-precipitation-data) from the US NOAA (National Oceanic and Atmospheric Administration) of daily precipitation data from different weather stations fom the month of Sep. 2013. We'll use the `precipitation` table here, which has a simplified schema:\n",
"\n",
"> `station_id INT`\n",
"\n",
"> `day INT`\n",
"\n",
"> `precipitation INT`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
" day | \n",
" precipitation | \n",
"
\n",
" \n",
" 16102 | \n",
" 1 | \n",
" 10 | \n",
"
\n",
" \n",
" 16102 | \n",
" 4 | \n",
" 10 | \n",
"
\n",
" \n",
" 16102 | \n",
" 24 | \n",
" 30 | \n",
"
\n",
"
"
],
"text/plain": [
"[(16102, 1, 10), (16102, 4, 10), (16102, 24, 30)]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM precipitation LIMIT 3;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a): Precipitation Champions\n",
"\n",
"Using a _single SQL query_, find all of the stations that had the highest daily precipitation (across all stations) on any given day **more than once**, and return the counts of how many \"best days\" they had in descending order. Further requirements:\n",
"* Use `GROUP BY`\n",
"* Write the shortest possible SQL query to accomplish this\n",
"* Return relation `(station_id, num_best_days)`\n",
"\n",
"*Hint: Make sure your query correctly handles ties!*\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
" num_best_days | \n",
"
\n",
" \n",
" 335701 | \n",
" 6 | \n",
"
\n",
" \n",
" 92707 | \n",
" 3 | \n",
"
\n",
" \n",
" 225707 | \n",
" 2 | \n",
"
\n",
" \n",
" 458701 | \n",
" 2 | \n",
"
\n",
" \n",
" 611807 | \n",
" 2 | \n",
"
\n",
" \n",
" 734201 | \n",
" 2 | \n",
"
\n",
"
"
],
"text/plain": [
"[(335701, 6), (92707, 3), (225707, 2), (458701, 2), (611807, 2), (734201, 2)]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT station_id, COUNT(day) AS num_best_days\n",
"FROM \n",
" precipitation,\n",
" (SELECT day AS maxd, MAX(precipitation) AS maxp\n",
" FROM precipitation\n",
" GROUP BY day)\n",
"WHERE day = maxd AND precipitation = maxp\n",
"GROUP BY station_id\n",
"HAVING COUNT(day) > 1\n",
"ORDER BY num_best_days DESC;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b.i): Median value, Pt. I\n",
"\n",
"Our goal in this part is going to be to find the **median value** of a list of values. We want to do this for the general case, however we'll start with a slightly simplified setting:\n",
"\n",
"Write a _single SQL query_ to find the median value of a certain attribute in a table, given that:\n",
"* The table contains an odd number of rows\n",
"* The values of this attribute are unique in the table (e.g. no two rows have the same value for this attribute)\n",
"\n",
"Also, **do not hard code the size of the table and/or use any `ORDER BY` clause. Think about the definition of median!**\n",
"\n",
"Let's test using the table $X$, constructed out of the distinct precipitation values of all stations on all days, having one attribute $p$:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"Done.\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" COUNT(*) | \n",
"
\n",
" \n",
" 59 | \n",
"
\n",
"
"
],
"text/plain": [
"[(59,)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP TABLE IF EXISTS X;\n",
"CREATE TABLE X AS SELECT DISTINCT precipitation AS p FROM precipitation;\n",
"SELECT COUNT(*) FROM X;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" median | \n",
"
\n",
" \n",
" 51 | \n",
"
\n",
"
"
],
"text/plain": [
"[(51,)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT x1.p AS median\n",
"FROM X AS x1\n",
"WHERE \n",
" (SELECT COUNT(*) FROM X AS x2 WHERE x2.p > x1.p)\n",
" =\n",
" (SELECT COUNT(*) FROM X AS x2 WHERE x2.p < x1.p);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b.ii): Median value, Pt. II\n",
"\n",
"Now, we want to find the median precipitation value across all days for a station `376101`. **However to get credit, your query must work for ALL stations- test it on some others to be certain it's correct!**\n",
"\n",
"Again, do not hardcode the size of any table and/or use any `ORDER BY` clause.\n",
"\n",
"Note that now:\n",
"* The number of rows can be even\n",
"* The values can have duplicates\n",
"\n",
"Also note that we will use the definition of _median_ where ties (e.g. when there are an even number of rows) are broken by averaging. Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" median | \n",
"
\n",
" \n",
" 15.0 | \n",
"
\n",
"
"
],
"text/plain": [
"[(15.0,)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT AVG(p) AS median FROM (\n",
" SELECT precipitation AS p \n",
" FROM precipitation AS p1\n",
" WHERE p1.station_id = 376101 AND\n",
" (SELECT COUNT(*) \n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation >= p2.precipitation)\n",
" >=\n",
" (SELECT COUNT(*)\n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation < p2.precipitation)\n",
" AND\n",
" (SELECT COUNT(*) \n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation > p2.precipitation)\n",
" <=\n",
" (SELECT COUNT(*)\n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation <= p2.precipitation)\n",
" GROUP BY p);"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" precipitation | \n",
"
\n",
" \n",
" 10 | \n",
"
\n",
" \n",
" 10 | \n",
"
\n",
" \n",
" 50 | \n",
"
\n",
" \n",
" 40 | \n",
"
\n",
" \n",
" 30 | \n",
"
\n",
" \n",
" 10 | \n",
"
\n",
" \n",
" 10 | \n",
"
\n",
" \n",
" 230 | \n",
"
\n",
" \n",
" 20 | \n",
"
\n",
"
"
],
"text/plain": [
"[(10,), (10,), (50,), (40,), (30,), (10,), (10,), (230,), (20,)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT p1.precipitation\n",
"FROM precipitation AS p1\n",
"WHERE p1.station_id = 376101 AND\n",
" (SELECT COUNT(*) \n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation >= p2.precipitation)\n",
" >=\n",
" (SELECT COUNT(*)\n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation < p2.precipitation)\n",
" AND\n",
" (SELECT COUNT(*) \n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation > p2.precipitation)\n",
" <=\n",
" (SELECT COUNT(*)\n",
" FROM precipitation AS p2\n",
" WHERE p2.station_id = 376101 AND p1.precipitation <= p2.precipitation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c):\n",
"\n",
"Write a _single SQL query_ to find stations which had a _smallest rainy day precipitation value_ (i.e. smallest non-zero `precipitation`) that was **within 400** of the _largest overall precipitation value_ (across all stations & all days). Return tuples of type `(station_id, min_rainy_day_precip)`.\n",
"\n",
"*Note: do not hard-code the maximum daily precipitation value, or any other values.*\n",
"\n",
"**Do this using `GROUP BY` and aggregate functions (e.g. `COUNT`, `SUM`, `MAX`, `MIN`)**. Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
" min_daily | \n",
"
\n",
" \n",
" 88302 | \n",
" 190 | \n",
"
\n",
" \n",
" 100504 | \n",
" 90 | \n",
"
\n",
" \n",
" 127705 | \n",
" 70 | \n",
"
\n",
" \n",
" 146002 | \n",
" 90 | \n",
"
\n",
" \n",
" 196404 | \n",
" 65 | \n",
"
\n",
" \n",
" 229402 | \n",
" 70 | \n",
"
\n",
" \n",
" 293502 | \n",
" 120 | \n",
"
\n",
" \n",
" 303805 | \n",
" 120 | \n",
"
\n",
" \n",
" 357302 | \n",
" 100 | \n",
"
\n",
" \n",
" 393905 | \n",
" 70 | \n",
"
\n",
" \n",
" 471202 | \n",
" 60 | \n",
"
\n",
" \n",
" 538802 | \n",
" 74 | \n",
"
\n",
" \n",
" 652102 | \n",
" 80 | \n",
"
\n",
" \n",
" 696402 | \n",
" 100 | \n",
"
\n",
" \n",
" 782104 | \n",
" 66 | \n",
"
\n",
" \n",
" 985505 | \n",
" 90 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302, 190),\n",
" (100504, 90),\n",
" (127705, 70),\n",
" (146002, 90),\n",
" (196404, 65),\n",
" (229402, 70),\n",
" (293502, 120),\n",
" (303805, 120),\n",
" (357302, 100),\n",
" (393905, 70),\n",
" (471202, 60),\n",
" (538802, 74),\n",
" (652102, 80),\n",
" (696402, 100),\n",
" (782104, 66),\n",
" (985505, 90)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
" SELECT p1.station_id, p1.min_daily\n",
" FROM (\n",
" SELECT station_id, MIN(precipitation) AS min_daily\n",
" FROM precipitation\n",
" WHERE precipitation > 0\n",
" GROUP BY station_id) AS p1\n",
" WHERE p1.min_daily >= (\n",
" SELECT MAX(p2.precipitation) - 400\n",
" FROM precipitation AS p2);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (d):\n",
"\n",
"Do the same as above, except do **not** use `GROUP BY` or any aggregate functions. Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" station_id | \n",
" min_daily | \n",
"
\n",
" \n",
" 88302 | \n",
" 190 | \n",
"
\n",
" \n",
" 100504 | \n",
" 90 | \n",
"
\n",
" \n",
" 127705 | \n",
" 70 | \n",
"
\n",
" \n",
" 146002 | \n",
" 90 | \n",
"
\n",
" \n",
" 196404 | \n",
" 65 | \n",
"
\n",
" \n",
" 229402 | \n",
" 70 | \n",
"
\n",
" \n",
" 293502 | \n",
" 120 | \n",
"
\n",
" \n",
" 303805 | \n",
" 120 | \n",
"
\n",
" \n",
" 357302 | \n",
" 100 | \n",
"
\n",
" \n",
" 393905 | \n",
" 70 | \n",
"
\n",
" \n",
" 471202 | \n",
" 60 | \n",
"
\n",
" \n",
" 538802 | \n",
" 74 | \n",
"
\n",
" \n",
" 652102 | \n",
" 80 | \n",
"
\n",
" \n",
" 696402 | \n",
" 100 | \n",
"
\n",
" \n",
" 782104 | \n",
" 66 | \n",
"
\n",
" \n",
" 985505 | \n",
" 90 | \n",
"
\n",
"
"
],
"text/plain": [
"[(88302, 190),\n",
" (100504, 90),\n",
" (127705, 70),\n",
" (146002, 90),\n",
" (196404, 65),\n",
" (229402, 70),\n",
" (293502, 120),\n",
" (303805, 120),\n",
" (357302, 100),\n",
" (393905, 70),\n",
" (471202, 60),\n",
" (538802, 74),\n",
" (652102, 80),\n",
" (696402, 100),\n",
" (782104, 66),\n",
" (985505, 90)]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT station_id, min_daily\n",
"FROM (\n",
" SELECT station_id, precipitation AS min_daily\n",
" FROM precipitation AS p1\n",
" WHERE precipitation > 0\n",
" AND NOT EXISTS (\n",
" SELECT p2.precipitation \n",
" FROM precipitation AS p2 \n",
" WHERE p2.station_id = p1.station_id AND precipitation > 0\n",
" AND p2.precipitation < p1.precipitation)) AS p3\n",
"WHERE NOT EXISTS (\n",
" SELECT p4.precipitation\n",
" FROM precipitation AS p4\n",
" WHERE p4.precipitation - 400 > p3.min_daily);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Problem 3: The Traveling SQL Server Salesman Problem\n",
"--------------------------------------------------\n",
"\n",
"SQL Server salespeople are lucky as far as traveling salespeople go- they only have to sell one or two big enterprise contracts, at one or two offices in Wisconsin, in order to make their monthly quota!\n",
"\n",
"Answer the following questions using the table of streets connecting company office buildings.\n",
"\n",
"**Note that for convenience all streets are included _twice_, as $A \\rightarrow B$ and $B \\rightarrow A$. This should make some parts of the problem easier, but remember to take it into account!**"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" id | \n",
" direction | \n",
" A | \n",
" B | \n",
" d | \n",
"
\n",
" \n",
" 0 | \n",
" F | \n",
" UW-Madison | \n",
" DooHickey Collective | \n",
" 7 | \n",
"
\n",
" \n",
" 0 | \n",
" R | \n",
" DooHickey Collective | \n",
" UW-Madison | \n",
" 7 | \n",
"
\n",
" \n",
" 1 | \n",
" F | \n",
" DooHickey Collective | \n",
" Gizmo Corp | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" R | \n",
" Gizmo Corp | \n",
" DooHickey Collective | \n",
" 2 | \n",
"
\n",
"
"
],
"text/plain": [
"[(0, u'F', u'UW-Madison', u'DooHickey Collective', 7),\n",
" (0, u'R', u'DooHickey Collective', u'UW-Madison', 7),\n",
" (1, u'F', u'DooHickey Collective', u'Gizmo Corp', 2),\n",
" (1, u'R', u'Gizmo Corp', u'DooHickey Collective', 2)]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%sql SELECT * FROM streets LIMIT 4;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (a): One-hop, two-hop, three-hop...\n",
"\n",
"Our salesperson has stopped at UW-Madison, to steal some cool new RDBMS technology from CS564-ers, and now wants to go sell it to a company _within 10 miles of UW-Madison and _passing through no more than 3 distinct streets_. Write a single query, not using `WITH` (see later on), to find all such companies.\n",
"\n",
"Your query should return tuples `(company, distance)` where distance is cumulative from UW-Madison.\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" company | \n",
" distance | \n",
"
\n",
" \n",
" DooHickey Collective | \n",
" 7 | \n",
"
\n",
" \n",
" DooHickey Corp | \n",
" 9 | \n",
"
\n",
" \n",
" Gadget Collective | \n",
" 9 | \n",
"
\n",
" \n",
" Gadget Corp | \n",
" 6 | \n",
"
\n",
" \n",
" Gizmo Corp | \n",
" 9 | \n",
"
\n",
" \n",
" Widget Industries | \n",
" 10 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'DooHickey Collective', 7),\n",
" (u'DooHickey Corp', 9),\n",
" (u'Gadget Collective', 9),\n",
" (u'Gadget Corp', 6),\n",
" (u'Gizmo Corp', 9),\n",
" (u'Widget Industries', 10)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"SELECT B AS company, d AS distance FROM (\n",
" SELECT s1.A AS A, s1.B AS B, s1.d AS d\n",
" FROM streets s1 \n",
" UNION\n",
" SELECT s1.A AS A, s2.B AS B, s1.d + s2.d AS d\n",
" FROM streets s1, streets s2\n",
" WHERE s1.B = s2.A AND s2.B <> s1.A\n",
" UNION\n",
" SELECT s1.A AS A, s3.B AS B, s1.d + s2.d + s3.d AS d\n",
" FROM streets s1, streets s2, streets s3\n",
" WHERE s1.B = s2.A AND s2.B = s3.A\n",
" AND s2.B <> s1.A AND s3.B <> s2.A AND s3.B <> s1.A\n",
")\n",
"WHERE A = 'UW-Madison' AND d <= 10;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (b): A stop at the Farm\n",
"\n",
"Now, our salesperson is out in the field, and wants to see all routes- and their distances- which will take him/her from a company $A$ to a company $B$, with the following constraints:\n",
"* The route must _pass through UW-Madison (in order to pick up new RDBMS tech to sell!)\n",
"* $A$ and $B$ must _each individually_ be _within 3 hops of UW-Madison\n",
"* $A$ and $B$ must be different companies\n",
"* _The total distance must be $<= 15$_\n",
"* Do not use `WITH`\n",
"* If you return a path $A \\rightarrow B$, _also include $B \\rightarrow A$ in your answer_\n",
"\n",
"In order to make your answer a bit cleaner, you may split into two queries, one of which creates a `VIEW`. A view is a virtual table based on the output set of a SQL query. A view can be used just like a normal table- the only difference under the hood is that the DBMS re-evaluates the query used to generate it each time a view is queried by a user (thus the data is always up-to date!)\n",
"\n",
"Here's a simple example of a view:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"Done.\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" A | \n",
" B | \n",
" d | \n",
"
\n",
" \n",
" DooHickey Collective | \n",
" Gizmo Corp | \n",
" 2 | \n",
"
\n",
" \n",
" Gizmo Corp | \n",
" DooHickey Collective | \n",
" 2 | \n",
"
\n",
" \n",
" Gizmo Corp | \n",
" Widget Industries | \n",
" 1 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'DooHickey Collective', u'Gizmo Corp', 2),\n",
" (u'Gizmo Corp', u'DooHickey Collective', 2),\n",
" (u'Gizmo Corp', u'Widget Industries', 1)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql \n",
"DROP VIEW IF EXISTS short_streets;\n",
"CREATE VIEW short_streets AS \n",
"SELECT A, B, d FROM streets WHERE d < 3;\n",
"SELECT * FROM short_streets LIMIT 3;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write your query or queries here:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"Done.\n",
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" company_1 | \n",
" company_2 | \n",
" distance | \n",
"
\n",
" \n",
" DooHickey Collective | \n",
" Gadget Corp | \n",
" 13 | \n",
"
\n",
" \n",
" DooHickey Corp | \n",
" Gadget Corp | \n",
" 15 | \n",
"
\n",
" \n",
" Gadget Collective | \n",
" Gadget Corp | \n",
" 15 | \n",
"
\n",
" \n",
" Gadget Corp | \n",
" DooHickey Collective | \n",
" 13 | \n",
"
\n",
" \n",
" Gadget Corp | \n",
" DooHickey Corp | \n",
" 15 | \n",
"
\n",
" \n",
" Gadget Corp | \n",
" Gadget Collective | \n",
" 15 | \n",
"
\n",
" \n",
" Gadget Corp | \n",
" Gizmo Corp | \n",
" 15 | \n",
"
\n",
" \n",
" Gizmo Corp | \n",
" Gadget Corp | \n",
" 15 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'DooHickey Collective', u'Gadget Corp', 13),\n",
" (u'DooHickey Corp', u'Gadget Corp', 15),\n",
" (u'Gadget Collective', u'Gadget Corp', 15),\n",
" (u'Gadget Corp', u'DooHickey Collective', 13),\n",
" (u'Gadget Corp', u'DooHickey Corp', 15),\n",
" (u'Gadget Corp', u'Gadget Collective', 15),\n",
" (u'Gadget Corp', u'Gizmo Corp', 15),\n",
" (u'Gizmo Corp', u'Gadget Corp', 15)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DROP VIEW IF EXISTS non_cyclic_distances;\n",
"CREATE VIEW non_cyclic_distances AS\n",
"SELECT s1.A, s1.B, s1.d AS dist\n",
"FROM streets s1\n",
" UNION\n",
"SELECT s1.A, s2.B, s1.d + s2.d AS dist\n",
"FROM streets s1, streets s2\n",
"WHERE s1.B = s2.A\n",
" UNION\n",
"SELECT s1.A, s3.B, s1.d + s2.d + s3.d AS dist\n",
"FROM streets s1, streets s2, streets s3\n",
"WHERE s1.B = s2.A AND s2.B = s3.A AND s3.B <> s1.A;\n",
"\n",
"SELECT d1.B AS company_1, d2.B AS company_2, MIN(d1.dist + d2.dist) as distance\n",
"FROM non_cyclic_distances d1, non_cyclic_distances d2\n",
"WHERE d1.A = 'UW-Madison' AND d2.A = 'UW-Madison' AND d1.B <> d2.B\n",
"GROUP BY d1.B, d2.B\n",
"HAVING distance <= 15;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (c): Ensuring acyclicity\n",
"\n",
"You may have noticed at this point that the network, or **_graph_**, of streets in our `streets` table seems like it might be a **tree**.\n",
"\n",
"Recall that a **_tree_** is an undirected graph where each node has exactly one parent (or, is the root, and has none), but may have multiple children. A slightly more formal definition of a tree is as follows: \n",
"\n",
"> An undirected graph $T$ is a **_tree_** if it is **connected**- meaning that there is a path between every pair of nodes- and has no **cycles** (informally, closed loops)\n",
"\n",
"Suppose that we guarantee the following about the graph of companies & streets determined by the `streets` table:\n",
"* It is _connected_\n",
"* It has no cycles of length $> 3$\n",
"\n",
"Write a _single SQL query_ which, if our graph is not a tree (i.e. if this isn't a [shaggy dog problem](https://en.wikipedia.org/wiki/Shaggy_dog_story)), **turns it into a tree** by deleting exactly _one_ street (*= two entries in our table!*). Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2 rows affected.\n"
]
},
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"DELETE FROM streets\n",
"WHERE id IN\n",
"(\n",
" SELECT s1.id\n",
" FROM streets s1, streets s2, streets s3\n",
" WHERE s1.B = s2.A AND s2.B = s3.A AND s3.B = s1.A\n",
" LIMIT 1\n",
");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part (d): The longest path\n",
"\n",
"In this part, we want to find the distance of the _longest path between any two companies_.\n",
"\n",
"Note that you should probably have done Part (c) first so that the graph of streets is a _tree_- this will make it much easier to work with!\n",
"\n",
"If you've done the other parts above, you might be skeptical that SQL can find paths of arbitrary lengths (which is what we need to do for this problem); how can we do this?\n",
"\n",
"There are some non-standard SQL functions- i.e. not universally supported by all SQL DBMSs- which are often incredibly useful. One of these is the `WITH RECURSIVE` clause, supported by SQLite.\n",
"\n",
"A `WITH` clause lets you define what is essentially a view within a clause, mainly to clean up your SQL code. A trivial example, to illustrate `WITH`:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" name | \n",
"
\n",
" \n",
" Gizmo Corp | \n",
"
\n",
" \n",
" GizmoWorks | \n",
"
\n",
" \n",
" Gizmo Industries | \n",
"
\n",
" \n",
" Gizmo Collective | \n",
"
\n",
" \n",
" GizmoCo | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'Gizmo Corp',),\n",
" (u'GizmoWorks',),\n",
" (u'Gizmo Industries',),\n",
" (u'Gizmo Collective',),\n",
" (u'GizmoCo',)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"WITH companies(name) AS (\n",
" SELECT DISTINCT A FROM streets)\n",
"SELECT * \n",
"FROM companies \n",
"WHERE name LIKE '%Gizmo%';"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also a recursive variant, `WITH RECURSIVE`. `WITH RECURSIVE` allows you to define a view just as above, except the table can be defined recursively. A `WITH RECURSIVE` clause must be of the following form:\n",
"\n",
"```\n",
"WITH RECURSIVE \n",
" R(...) AS (\n",
" SELECT ... \n",
" UNION [ALL] \n",
" SELECT ... FROM R, ...\n",
" )\n",
"...\n",
"```\n",
"\n",
"`R` is the _recursive table_. The `AS` clause contains two `SELECT` statements, conjoined by a `UNION` or `UNION ALL`; the first `SELECT` statement provides the initial / base case values, and the second or _recursive_ `SELECT` statement must include the recursive table in its `FROM` clause.\n",
"\n",
"Basically, the recursive `SELECT` statement continues executing on the tuple _most recently inserted into `R`_, inserting output rows back into `R`, and proceeding recursively in this way, until it no longer outputs any rows, and then stops. See the [SQLite documentation](https://www.sqlite.org/lang_with.html) for more detail.\n",
"\n",
"The following example computes $5! = 5*4*3*2*1$ using `WITH RECURSIVE`:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" x | \n",
"
\n",
" \n",
" 120 | \n",
"
\n",
"
"
],
"text/plain": [
"[(120,)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"WITH RECURSIVE\n",
" factorials(n,x) AS (\n",
" SELECT 1, 1\n",
" UNION\n",
" SELECT n+1, (n+1)*x FROM factorials WHERE n < 5)\n",
"SELECT x FROM factorials WHERE n = 5;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this example:\n",
"\n",
"1. First, `(1,1)` is inserted into the table `factorials` (the base case).\n",
"2. Next, this tuple is returned by the recursive select, as `(n, x)`, and we insert the result back into `factorials`: `(1+1, (1+1)*1) = (2,2)`\n",
"3. Next, we do the same with the last tuple inserted into `factorials`- `(2,2)`- and insert `(2+1, (2+1)*2) = (3,6)`\n",
"4. And again: get `(3,6)` from `factorials` and insert `(3+1, (3+1)*6) = (4,24)` back in\n",
"5. And again: `(4,24)` -> `(4+1, (4+1)*24) = (5,120)`\n",
"6. Now the last tuple inserted into `factorials` is `(5, 120)`, which fails the `WHERE n < 5` clause, and thus our recursive select returns an empty set, and our `WITH RECURSIVE` statement concludes\n",
"7. Finally, now that our `WITH RECURSIVE` clause has concluded, we move on to the `SELECT x FROM factorials WHERE n=5` clause, which gets us our answer!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now, your turn!\n",
"\n",
"Write a single SQL query that uses `WITH RECURSIVE` to find the furthest (by distance) pair of companies that still have a path of streets connecting them. Your query should return `(A, B, distance)`.\n",
"\n",
"Write your query here:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" A | \n",
" B | \n",
" distance | \n",
"
\n",
" \n",
" GadgetWorks | \n",
" ThingWorks | \n",
" 63 | \n",
"
\n",
"
"
],
"text/plain": [
"[(u'GadgetWorks', u'ThingWorks', 63)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%sql\n",
"WITH RECURSIVE\n",
" paths(a, b, b_prev, d) AS (\n",
" SELECT A, B, A, d FROM streets\n",
" UNION ALL\n",
" SELECT p.a, s.B, s.A, p.d + s.d\n",
" FROM paths p, streets s\n",
" WHERE p.b = s.A AND s.B <> p.a AND s.B <> p.b_prev)\n",
"SELECT a AS A, b AS B, MAX(d) AS distance FROM paths;"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Note on alternate output\n",
"\n",
"**NOTE:** The **_distance_** of the longest path could be **49 _OR_ 63** depending on which street you deleted in Part (c)!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bonus Problem: The longest path, Pt. II\n",
"\n",
"Using `WITH RECURSIVE` may be a little tricky syntactically, but it is quite elegant. What would alternatives look like? We already know we can't do it with non-recursive SQL.\n",
"\n",
"For this problem, **use SQL _and_ Python** to find the longest path (by cumulative distance) between two companies in the streets graph, **without using a `WITH` clause**, again returning a single tuple of the form `(A, B, distance)`.\n",
"\n",
"_Note: Be careful of trivial cycles in the graph! Especially if you write recursive functions in your python code, note that IPython handles hitting the max recursion depth pretty poorly (i.e. crashes / freezes up). You can debug in a normal terminal first if this is a concern._\n",
"\n",
"Write your code / queries here:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Done.\n",
"Done.\n",
"(u'ThingWorks', u'GadgetWorks', 63)\n"
]
}
],
"source": [
"from collections import defaultdict\n",
"\n",
"# Get the graph\n",
"streets = %sql SELECT * FROM streets;\n",
"companies = %sql SELECT DISTINCT A FROM streets;\n",
"edges = defaultdict(list)\n",
"for s in streets:\n",
" edges[s.A].append((s.B, s.d))\n",
"companies = [c[0] for c in companies]\n",
"\n",
"def longest_path_from(a, path_to=None):\n",
" if path_to is None:\n",
" path_to = [a]\n",
" if len(edges[a]) == 0:\n",
" return 0, []\n",
" longest_paths = []\n",
" for b, dist in edges[a]:\n",
" if b not in path_to:\n",
" d, p = longest_path_from(b, path_to + [b])\n",
" longest_paths.append((dist + d, [a,b] + p))\n",
" longest_paths.sort(reverse=True)\n",
" if len(longest_paths) > 0:\n",
" return longest_paths[0]\n",
" else:\n",
" return 0, []\n",
"\n",
"# Find the longest path exhaustively... find the longest path from each node\n",
"longest_paths = []\n",
"for a in companies:\n",
" d, p = longest_path_from(a)\n",
" longest_paths.append((d, p))\n",
"longest_paths.sort(reverse=True)\n",
"lp = longest_paths[0]\n",
"print (lp[1][0], lp[1][-1], lp[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Note on alternate output\n",
"\n",
"**NOTE:** The **_distance_** of the longest path could be **49 _OR_ 63** depending on which street you deleted in Part (c)!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}