- Get link
- X
- Other Apps
VroyCLe™ Featured Post
Posted by
VroyCLe Vision
- Get link
- X
- Other Apps
Data Science with SQL in Python -VroyCLe™
Vaibhav Ghadge ,14 Oct 2019
There are many ways to use SQL in Python. Multiple libraries have been developed for this purpose that can be utilized & can provide a plethora of options for any programmer. When I had created a Flask application for a full-fledged web developed website, I had used SQLite. MySQL & SQL are popular examples of these libraries but since we’re using Python for this, we will use the sqlalchemy engine! Let’s start by installing the SQL package so Python can utilize it.
!pip install sqlalchemy
The following code below will return all entities (columns) from the table & then save the response into a data frame, displaying the head.
# Example
sql = """
SELECT *
FROM table
"""
dataframe = pd.read_sql_query(sql, engine)
dataframe.head()
If you want to, you could also pull specific columns from the table, using this code instead:
# Example
SELECT entity_1, entity_2, entity_3
FROM table
If you are dealing with multiple tables, you may need to specify which entity from which table. The entities we want may come from different tables in the database. In this case, we will focus on a single table however you can use name spacing syntax in your SQL statement to do this.
# Example
SELECT table.entity_1, table.entity_2, table.entity_3
FROM table
You can also assign aliases to each entity name or table name for simplification or readability purposes. According to W3 Schools, SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of the query.
# Example
SELECT t.entity_1 AS name, t.entity_2 AS id
FROM table AS t
If you want to get the distinct rows from a column, you can send this SQL statement:
# Example
SELECT DISTINCT entity_1
FROM table
If you want to order your data by a specific column (or multiple columns) you can use order by & specify if you want ASC (ascending) or DESC (descending) order. Remember, if you use multiple columns in order by, the order in which SQL orders the data will be from left to right.
# Example
SELECT entity_1, entity_2, entity_3
FROM table
ORDER BY entity_1 DESC, entity_3 ASC
If you want to include a condition for the query, you can use where. You can either use boolean conditions or wildcards for string entities.
# Example SELECT * FROM table WHERE entity_1 > 5# Example 2 WHERE entity_1 BETWEEN 5 AND 10# Example 3 WHERE entity_1 > 5 AND entity_1 < 10
If you want to deal with NULL values or ‘NaN’ values, you can use the following code:
# Example SELECT * FROM table WHERE entity_1 IS NULL# Example 2 WHERE entity_1 IS NOT NULL
Often you need to aggregate the data, group the data & apply conditions to the aggregated data. Here are ways of doing this!
# Example SELECT SUM(entity_1) AS sum, entity_2 FROM table# Example 2 # Use aggregates; you should use HAVING instead of WHERE SELECT SUM(entity_1) AS sum, entity_2 FROM table HAVING entity_2 BETWEEN 5 AND 10# Example 3 # To group your data by specific entity SELECT entity_1, SUM(entity_2) AS sum FROM table GROUP BY entity_1 HAVING entity_3 BETWEEN 5 AND 10
Joining Tables in SQL
# Example
SELECT t.entity_1, t.entity_2
FROM table AS t
INNER JOIN table_2 AS t_2 ON t_2.key = t.key
Now that we understand the basics of Sequel, let’s apply this to a data science problem using Python, mainly the sqlalchemy library we installed! I will provide the question/example, SQL syntax followed by an image of the output. If you want more information & a thorough explanation for SQL syntax & concepts, check out the W3Schools link.
For the sake of this article, we will be using data that’s already been stored online. To better understand the rest of this guide, you will need:
- A Python IDE (I did this in Jupyter Notebook)
- Somewhat of an understanding of SQL syntax & concepts
# Imports import pandas as pd from sqlalchemy import create_engineengine = create_engine('postgres://DSI8:p3c1341f0f241ef04befd0f0b3acc0365eb30839b9408f9a9b3278d96966f34b8@ec2-34-202-213-35.compute-1.amazonaws.com:5432/dc5rooirs71hh0')
Select & From
SQL is used to pull specific data from a database. The two primary clauses that must be present in every query are
SELECT
& FROM
.SELECT
allows you to select a subset of columns from a tableFROM
: Since there are often many tables in a database, it's important to specify which table you're querying.
sql = """
SELECT *
FROM restaurants
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 1: Let’s view the
name
& calories
columns from the foods
table.sql = """
SELECT name, calories
FROM foods;
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Let’s retrieve the names of all the categories.
sql = """
SELECT name
FROM categories
"""
df = pd.read_sql_query(sql, engine)
df.head()
Name Spacing
Sometimes you see the columns prefixed by their corresponding table. This is overkill when you’re just querying, one table, but becomes important once you query from multiple tables.
Ex 1: Let’s show the
name
&calories
columns from the foods
tablesql = """
SELECT foods.name, foods.calories
FROM foods
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Give me the names of all the
categories
using name spacing.sql = """
SELECT categories.name
FROM categories
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 3: I want every column from the
foods
table. Using name spacing, how can we apply a wildcard?sql = """
SELECT foods.*
FROM foods
"""
df = pd.read_sql_query(sql, engine)
df.head()
Aliasing
Writing out the same table can get pretty cumbersome. Thankfully we can give our tables a temporary (& hopefully shorter) name. This is called aliasing.
Ex 1: How can we query the
name
& calories
columns from the foods
table, & rename the foods
table to f
?sql = """
SELECT f.*
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Let’s retrieve the names of all the
restaurants
, & alias the table?sql = """
SELECT r.name
FROM restaurants AS r
"""
df = pd.read_sql_query(sql, engine)
df.head()
Recall that the
foods
, restaurant
&categories
tables all have a name
column. When we start combining tables into a single query, we might want to give each name column an alias as well.
Ex 3: Let’s show the
name
(temporarily renamed to food) &calories
columns from the foods
table.sql = """
SELECT f.name AS food, f.calories
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 4: Give all the category names, rename the
name
column to category
.sql = """
SELECT c.name AS category
FROM categories AS c
"""
df = pd.read_sql_query(sql, engine)
df.head()
Distinct
DISTINCT
returns a list of unique values from a given column.
Ex 1: Show the unique names (renamed to
food
) from the foods
table.sql = """
SELECT DISTINCT f.name AS food
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: From the
foods
table, query the unique values from the restaurant_id
column.sql = """
SELECT DISTINCT f.restaurant_id
FROM foods AS f
"""
df = pd.read_sql_query(sql, engine)
df.head()
Order By
Sometimes it makes sense to order your query on a certain column. For example, we might want to get a list of users sorted alphabetically. You can order on multiple columns. Priority is given from left to right in your
ORDER BY
clause.
Ex 1: Show the
name
(temporarily renamed to food
)&calories
columns from the foods
table, ordered from most caloric to least.sql = """
SELECT f.name AS foods, f.calories
FROM foods AS f
ORDER BY f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Give all the
categories
in reverse alphabetical order.sql = """
SELECT c.name
FROM categories c
ORDER BY c.name DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 3: Retrieve all the columns from
foods
, from most carbs
to leastsql = """
SELECT f.*
FROM foods f
ORDER BY f.carbs DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 4: What food has the most fat?
sql = """
SELECT f.*
FROM foods f
ORDER BY f.fat DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 5: Give the
restaurant_id
(renamed to rid
), name
&calories
from foods
. Order first by restaurant_id
from smallest to biggest, then by calories
from biggest to smallest.sql = """
SELECT f.restaurant_id AS rid, f.name, f.calories
FROM foods f
ORDER BY f.restaurant_id ASC, f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 6: Give the
calories
&name
from the foods
table. Ordered first by calories
in descending order, then by name
in alphabetical order.sql = """
SELECT f.name, f.calories
FROM foods f
ORDER BY f.calories DESC, f.name ASC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Limit
Rather than returning ALL rows from a given table, you might only want a subset. This can be achieved with the
LIMIT
command.
Ex 1: Give the
name
&calorie
of the 20 most caloric items from the foods
table.sql = """
SELECT f.name, f.calories
FROM foods f
ORDER BY f.calories DESC
LIMIT 20
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: What are the top ten most fatty foods?
sql = """
SELECT f.*
FROM foods f
ORDER BY f.fat DESC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()
Where
One of the more important skills in SQL is the ability to filter your queries based a certain condition. This is accomplished with the
WHERE
command.
Numerical filters are similar to python:
- “Greater than”:
>
AND “Less than”:<
- “Greater than or equal to”:
>=
AND “Less than or equal to”:<=
- “Equal to”:
=
Ex 1: Give the
name
&calories
of foods
with more than 1,000 calories.sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories > 1000
ORDER BY f.calories
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: What are the top 10 least caloric foods over 100 calories?
sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories > 100
ORDER BY f.calories ASC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()
And/Or
Multiple
WHERE
clauses can be chained together with AND
&OR
, similar to chaining multiple boolean expressions in Python.
Ex 1: Give the
name
, calories
&carbs
from all foods
over 1,000 calories or having at least 30g of carbs.sql = """
SELECT f.name, f.calories, f.carbs
FROM foods f
WHERE f.calories > 1000
OR f.carbs >= 30
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Give all foods from Jimmy Johns (
restaurant_id
= 27) with 700 or more calories.sql = """
SELECT f.*
FROM foods f
WHERE f.restaurant_id = 27
AND f.calories >= 700
"""
df = pd.read_sql_query(sql, engine)
df.head()
Between
You can also filter numerically based on a range with
BETWEEN
.
Ex 1: Give the
name
&calories
of foods
that are between 0 & 10 calories (inclusive).sql = """
SELECT f.name, f.calories
FROM foods f
WHERE f.calories BETWEEN 0 AND 10
"""
df = pd.read_sql_query(sql, engine)
df.head()
Joins
It’s very common to want to combine information from multiple tables into one query. For instance, we might want to find the user’s name for a given blog post. We can do this by utilizing joins. There are several types of joins:
- Inner join:
- Left/Right join:
- Left/Right outer join:
- Unions:
*NOTE*: The most common join is the inner join. Joins are the most common interview question when asked about knowledge of SQL! This is why I wrote this article & decided to focus on this section the most for this scenario.
Ex 1: Give the names of every food item with their associated restaurant. Be sure to alias the columns.
sql = """
SELECT f.name AS food, r.name AS restaurant
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: Give the names of every food item from Burger King.
sql = """
SELECT f.*, r.name AS restaurant
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
WHERE r.name = 'Burger King'
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 3: Give the names of every food item with their associated category.
sql = """
SELECT f.name AS food, c.name AS category
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 4: Give the
name
, restaurant
, category
&calories
of every food item. Sorted alphabetically by restaurant
, then by calories
in descending order.sql = """
SELECT f.name AS food, f.calories, c.name AS category, r.name AS restaurant
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
INNER JOIN restaurants r ON r.id = f.restaurant_id
ORDER BY r.name, f.calories DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 5: Give the names of every food item in the Desserts category.
sql = """
SELECT f.name AS food, c.name AS category
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
WHERE c.name = 'Desserts'
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 6: Give the
name
, restaurant
&calories
of the top ten most caloric "Kid's Meals" (this is a category).sql = """
SELECT f.name AS food, f.calories, c.name AS category, r.name AS restaurant
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
INNER JOIN restaurants r ON r.id = f.restaurant_id
WHERE c.name = 'Kid''s Meals'
ORDER BY f.calories DESC
LIMIT 10
"""
df = pd.read_sql_query(sql, engine)
df.head()
Aggregating
Sometimes we might want to reduce our query to a single value. For example, we may want to know how many users our in our database. The
COUNT
in the previous query is an aggregate function. The most common aggregate functions are:COUNT, AVG, MIN, MAX, SUM
Ex 1: What’s total number of calories in the entire
foods
table?sql = """
SELECT SUM(f.calories)
FROM foods f
"""
df = pd.read_sql_query(sql, engine)
df.head()
Group By
Often we’ll want to group our data into buckets and then run some sort of aggregate function. Every column you’re returning that isn’t being aggregated needs to be in the
GROUP BY
clause.
Ex 1: Return the average number of calories for each restaurant, sorted alphabetically.
sql = """
SELECT r.name, AVG(f.calories)
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
GROUP BY r.name
ORDER BY r.name
"""
df = pd.read_sql_query(sql, engine)
df.head()
Having
Sometimes you might want to use the result of an aggregate function as a filter. We can do this with
HAVING
, which is similar to WHERE
but for aggregates.
Ex 1: Give the
name
& average calories
for all restaurants
with more than 700 calories per menu item on average.sql = """
SELECT r.name, AVG(f.calories)
FROM foods f
INNER JOIN restaurants r ON r.id = f.restaurant_id
GROUP BY r.name
HAVING AVG(f.calories) > 700
ORDER BY r.name
"""
df = pd.read_sql_query(sql, engine)
df.head()
Ex 2: What
categories
average more than 750 calories
per item? Order your results from most caloric to least.sql = """
SELECT c.name, AVG(f.calories)
FROM foods f
INNER JOIN categories_foods cf ON cf.food_id = f.id
INNER JOIN categories c ON c.id = cf.category_id
GROUP BY c.name
HAVING AVG(f.calories) > 750
ORDER BY AVG(f.calories) DESC
"""
df = pd.read_sql_query(sql, engine)
df.head()
Wild Cards
We can use
LIKE
and wildcards (%) to broaden our string filters. Note:ILIKE
is a case insensitiveLIKE
.- You can negate a
LIKE
withNOT LIKE
. - In this lecture, all wildcards need to be double parentheses (
%%
).
Ex 1: Find all the Whoppers in the
foods
table.sql = """
SELECT f.*
FROM foods f
WHERE f.name ILIKE '%%Whopper%%'
"""
df = pd.read_sql_query(sql, engine)
df.head()
Null Values
Databases can have null values (our equivalent to
NaN
).
Ex 1: Give me everything from the foods table with null values in the
calories
column.sql = """
SELECT f.*
FROM foods f
WHERE f.calories IS NULL
"""
df = pd.read_sql_query(sql, engine)
df.head()
Thank you for reading this article taking interest in Sequel application within Python to be able to solve data science concepts. This article will be updated regularly as well. What we didn’t cover is how to apply the data science concepts such as cleaning, exploratory data analysis, munging, feature engineering & machine learning modeling. We can apply this instantly after retrieving the data! We use Pandas Python library however to view the data. We use Pandas frequently in data science.
Click here for more-->>
Comments
Post a Comment
Thanks for message to VroyCLe™.