Data Science with SQL in Python -VroyCLe™

Vaibhav Ghadge ,14 Oct 2019

There are many ways to use SQL in Python. Multiple libraries have been developed for this purpose that can be utilized & can provide a plethora of options for any programmer. When I had created a Flask application for a full-fledged web developed website, I had used SQLite. MySQL & SQL are popular examples of these libraries but since we’re using Python for this, we will use the sqlalchemy engine! Let’s start by installing the SQL package so Python can utilize it.

!pip install sqlalchemy

The following code below will return all entities (columns) from the table & then save the response into a data frame, displaying the head.

# Example 
sql = """
SELECT *
FROM table
"""
dataframe = pd.read_sql_query(sql, engine)
dataframe.head()

If you want to, you could also pull specific columns from the table, using this code instead:

# Example
SELECT entity_1, entity_2, entity_3
FROM table

If you are dealing with multiple tables, you may need to specify which entity from which table. The entities we want may come from different tables in the database. In this case, we will focus on a single table however you can use name spacing syntax in your SQL statement to do this.

# Example
SELECT table.entity_1, table.entity_2, table.entity_3
FROM table

You can also assign aliases to each entity name or table name for simplification or readability purposes. According to W3 Schools, SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of the query.

# Example
SELECT t.entity_1 AS name, t.entity_2 AS id
FROM table AS t

If you want to get the distinct rows from a column, you can send this SQL statement:

# Example
SELECT DISTINCT entity_1
FROM table

If you want to order your data by a specific column (or multiple columns) you can use order by & specify if you want ASC (ascending) or DESC (descending) order. Remember, if you use multiple columns in order by, the order in which SQL orders the data will be from left to right.

# Example
SELECT entity_1, entity_2, entity_3
FROM table
ORDER BY entity_1 DESC, entity_3 ASC

If you want to include a condition for the query, you can use where. You can either use boolean conditions or wildcards for string entities.

# Example
SELECT *
FROM table
WHERE entity_1 > 5# Example 2
WHERE entity_1 BETWEEN 5 AND 10# Example 3
WHERE entity_1 > 5 AND entity_1 < 10

If you want to deal with NULL values or ‘NaN’ values, you can use the following code:

# Example
SELECT *
FROM table
WHERE entity_1 IS NULL# Example 2
WHERE entity_1 IS NOT NULL

Often you need to aggregate the data, group the data & apply conditions to the aggregated data. Here are ways of doing this!

# Example 
SELECT SUM(entity_1) AS sum, entity_2
FROM table# Example 2 
# Use aggregates; you should use HAVING instead of WHERE
SELECT SUM(entity_1) AS sum, entity_2
FROM table
HAVING entity_2 BETWEEN 5 AND 10# Example 3
# To group your data by specific entity
SELECT entity_1, SUM(entity_2) AS sum
FROM table
GROUP BY entity_1
HAVING entity_3 BETWEEN 5 AND 10

Joining Tables in SQL

When querying data from multiple tables, you need to join these tables. There are multiple ways to join tables in SQL. This figure illustrates these joins. You will likely work with inner joins more often. However, it is important to understand what each type of join does. Joining tables can only be done when there is a common entity between the two tables, and you need to define that relationship using on.

# Example
SELECT t.entity_1, t.entity_2
FROM table AS t
INNER JOIN table_2 AS t_2 ON t_2.key = t.key

Now that we understand the basics of Sequel, let’s apply this to a data science problem using Python, mainly the sqlalchemy library we installed! I will provide the question/example, SQL syntax followed by an image of the output. If you want more information & a thorough explanation for SQL syntax & concepts, check out the W3Schools link.

For the sake of this article, we will be using data that’s already been stored online. To better understand the rest of this guide, you will need:

A Python IDE (I did this in Jupyter Notebook)
Somewhat of an understanding of SQL syntax & concepts

# Imports
import pandas as pd
from sqlalchemy import create_engineengine = create_engine('postgres://DSI8:p3c1341f0f241ef04befd0f0b3acc0365eb30839b9408f9a9b3278d96966f34b8@ec2-34-202-213-35.compute-1.amazonaws.com:5432/dc5rooirs71hh0')

Select & From

SQL is used to pull specific data from a database. The two primary clauses that must be present in every query are SELECT& FROM.

SELECT allows you to select a subset of columns from a table
FROM: Since there are often many tables in a database, it's important to specify which table you're querying.

sql = """
SELECT *
FROM restaurants
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 1: Let’s view the name & calories columns from the foods table.

sql = """
SELECT name, calories 
FROM foods;
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Let’s retrieve the names of all the categories.

sql = """
SELECT name
FROM categories
"""
df = pd.read_sql_query(sql, engine)
df.head()

Name Spacing

Sometimes you see the columns prefixed by their corresponding table. This is overkill when you’re just querying, one table, but becomes important once you query from multiple tables.

Ex 1: Let’s show the name&calories columns from the foods table

sql = """
SELECT foods.name, foods.calories
FROM foods
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Give me the names of all the categories using name spacing.

sql = """
SELECT categories.name
FROM categories
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 3: I want every column from the foods table. Using name spacing, how can we apply a wildcard?

sql = """
SELECT foods.*
FROM foods
"""
df = pd.read_sql_query(sql, engine)
df.head()

Aliasing

Writing out the same table can get pretty cumbersome. Thankfully we can give our tables a temporary (& hopefully shorter) name. This is called aliasing.

Ex 1: How can we query the name& calories columns from the foods table, & rename the foods table to f?

sql = """
SELECT f.*
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Let’s retrieve the names of all the restaurants, & alias the table?

sql = """
SELECT r.name
FROM restaurants AS r
"""
df = pd.read_sql_query(sql, engine)
df.head()

Recall that the foods, restaurant&categories tables all have a name column. When we start combining tables into a single query, we might want to give each name column an alias as well.

Ex 3: Let’s show the name (temporarily renamed to food) &calories columns from the foods table.

sql = """
SELECT f.name AS food, f.calories
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 4: Give all the category names, rename the name column to category.

sql = """
SELECT c.name AS category
FROM categories AS c
"""
df = pd.read_sql_query(sql, engine)
df.head()

Distinct

DISTINCT returns a list of unique values from a given column.

Ex 1: Show the unique names (renamed to food) from the foods table.

sql = """
SELECT DISTINCT f.name AS food
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: From the foods table, query the unique values from the restaurant_id column.

sql = """
SELECT DISTINCT f.restaurant_id
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()

Order By

Sometimes it makes sense to order your query on a certain column. For example, we might want to get a list of users sorted alphabetically. You can order on multiple columns. Priority is given from left to right in your ORDER BY clause.

Ex 1: Show the name (temporarily renamed to food)&calories columns from the foods table, ordered from most caloric to least.

sql = """
SELECT f.name AS foods, f.calories
FROM foods AS f
ORDER BY f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Give all the categories in reverse alphabetical order.

sql = """
SELECT c.name
FROM categories c
ORDER BY c.name DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 3: Retrieve all the columns from foods, from most carbs to least

sql = """
SELECT f.*
FROM foods f
ORDER BY f.carbs DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 4: What food has the most fat?

sql = """
SELECT f.*
FROM foods f
ORDER BY f.fat DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 5: Give the restaurant_id (renamed to rid), name&calories from foods. Order first by restaurant_id from smallest to biggest, then by calories from biggest to smallest.

sql = """
SELECT f.restaurant_id AS rid, f.name, f.calories
FROM foods f
ORDER BY f.restaurant_id ASC, f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 6: Give the calories&name from the foods table. Ordered first by calories in descending order, then by name in alphabetical order.

sql = """
SELECT f.name, f.calories
FROM foods f
ORDER BY f.calories DESC, f.name ASC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Limit

Rather than returning ALL rows from a given table, you might only want a subset. This can be achieved with the LIMIT command.

Ex 1: Give the name&calorie of the 20 most caloric items from the foods table.

sql = """
SELECT f.name, f.calories
FROM foods f
ORDER BY f.calories DESC
LIMIT 20
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: What are the top ten most fatty foods?

sql = """
SELECT f.*
FROM foods f
ORDER BY f.fat DESC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()

Where

One of the more important skills in SQL is the ability to filter your queries based a certain condition. This is accomplished with the WHERE command.

Numerical filters are similar to python:

“Greater than”: > AND “Less than”: <
“Greater than or equal to”: >= AND “Less than or equal to”: <=
“Equal to”: =

Ex 1: Give the name&calories of foods with more than 1,000 calories.

sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories > 1000
ORDER BY f.calories
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: What are the top 10 least caloric foods over 100 calories?

sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories > 100
ORDER BY f.calories ASC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()

And/Or

Multiple WHERE clauses can be chained together with AND&OR, similar to chaining multiple boolean expressions in Python.

Ex 1: Give the name, calories&carbs from all foods over 1,000 calories or having at least 30g of carbs.

sql = """
SELECT f.name, f.calories, f.carbs
FROM foods f
WHERE f.calories > 1000
OR f.carbs >= 30
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Give all foods from Jimmy Johns (restaurant_id = 27) with 700 or more calories.

sql = """
SELECT f.*
FROM foods f
WHERE f.restaurant_id = 27
AND f.calories >= 700
"""
df = pd.read_sql_query(sql, engine)
df.head()

Between

You can also filter numerically based on a range with BETWEEN.

Ex 1: Give the name&calories of foods that are between 0 & 10 calories (inclusive).

sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories BETWEEN 0 AND 10
"""
df = pd.read_sql_query(sql, engine)
df.head()

Joins

It’s very common to want to combine information from multiple tables into one query. For instance, we might want to find the user’s name for a given blog post. We can do this by utilizing joins. There are several types of joins:

Inner join:
Left/Right join:
Left/Right outer join:
Unions:

*NOTE*: The most common join is the inner join. Joins are the most common interview question when asked about knowledge of SQL! This is why I wrote this article & decided to focus on this section the most for this scenario.

Ex 1: Give the names of every food item with their associated restaurant. Be sure to alias the columns.

sql = """
SELECT f.name AS food, r.name AS restaurant
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: Give the names of every food item from Burger King.

sql = """
SELECT f.*, r.name AS restaurant
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
WHERE r.name = 'Burger King'
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 3: Give the names of every food item with their associated category.

sql = """
SELECT f.name AS food, c.name AS category
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 4: Give the name, restaurant, category&calories of every food item. Sorted alphabetically by restaurant, then by calories in descending order.

sql = """
SELECT f.name AS food, f.calories, c.name AS category, r.name AS restaurant
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
INNER JOIN restaurants r ON r.id = f.restaurant_id
ORDER BY r.name, f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 5: Give the names of every food item in the Desserts category.

sql = """
SELECT f.name AS food, c.name AS category
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
WHERE c.name = 'Desserts'
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 6: Give the name, restaurant&calories of the top ten most caloric "Kid's Meals" (this is a category).

sql = """
SELECT f.name AS food, f.calories, c.name AS category, r.name AS restaurant
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
INNER JOIN restaurants r ON r.id = f.restaurant_id
WHERE c.name = 'Kid''s Meals'
ORDER BY f.calories DESC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()

Aggregating

Sometimes we might want to reduce our query to a single value. For example, we may want to know how many users our in our database. The COUNT in the previous query is an aggregate function. The most common aggregate functions are:COUNT, AVG, MIN, MAX, SUM

Ex 1: What’s total number of calories in the entire foods table?

sql = """
SELECT SUM(f.calories)
FROM foods f
"""
df = pd.read_sql_query(sql, engine)
df.head()

Group By

Often we’ll want to group our data into buckets and then run some sort of aggregate function. Every column you’re returning that isn’t being aggregated needs to be in the GROUP BY clause.

Ex 1: Return the average number of calories for each restaurant, sorted alphabetically.

sql = """
SELECT r.name, AVG(f.calories)
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
GROUP BY r.name
ORDER BY r.name
"""
df = pd.read_sql_query(sql, engine)
df.head()

Having

Sometimes you might want to use the result of an aggregate function as a filter. We can do this with HAVING, which is similar to WHERE but for aggregates.

Ex 1: Give the name& average calories for all restaurants with more than 700 calories per menu item on average.

sql = """
SELECT r.name, AVG(f.calories)
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
GROUP BY r.name
HAVING AVG(f.calories) > 700
ORDER BY r.name
"""
df = pd.read_sql_query(sql, engine)
df.head()

Ex 2: What categories average more than 750 calories per item? Order your results from most caloric to least.

sql = """
SELECT c.name, AVG(f.calories)
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
GROUP BY c.name
HAVING AVG(f.calories) > 750
ORDER BY AVG(f.calories) DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()

Wild Cards

We can use LIKE and wildcards (%) to broaden our string filters. Note:

ILIKE is a case insensitive LIKE.
You can negate a LIKE with NOT LIKE.
In this lecture, all wildcards need to be double parentheses (%%).

Ex 1: Find all the Whoppers in the foods table.

sql = """
SELECT f.*
FROM foods f
WHERE f.name ILIKE '%%Whopper%%'
"""
df = pd.read_sql_query(sql, engine)
df.head()

Null Values

Databases can have null values (our equivalent to NaN).

Ex 1: Give me everything from the foods table with null values in the calories column.

sql = """
SELECT f.*
FROM foods f
WHERE f.calories IS NULL
"""
df = pd.read_sql_query(sql, engine)
df.head()

Thank you for reading this article taking interest in Sequel application within Python to be able to solve data science concepts. This article will be updated regularly as well. What we didn’t cover is how to apply the data science concepts such as cleaning, exploratory data analysis, munging, feature engineering & machine learning modeling. We can apply this instantly after retrieving the data! We use Pandas Python library however to view the data. We use Pandas frequently in data science.

Click here for more-->>

VroyCLe™

Search This Blog

VroyCLe™ Featured Post

संभाजी - विश्वास पाटील. संपूर्ण कादंबरी