Skip to main content

SQL Basics

Beginners guide to SQL for a career in Data Analytics

Data Analytics is the science of examining raw data to draw conclusions about that information. Being a Data Analyst myself, it's quite understandable that we can't do anything until we have the data to work with. So, how can we get that data? 

Till date, most of the data being captured for any purpose stays in a tabular format. We need to query the data using “Structured Query Language (SQL)” - the standard language to query a database. 

So let's begin and understand the basics of SQL:

What is SQL?

SQL is a programming language designed for managing data in a relational database. SQL is used for accessing, cleaning, and analyzing data that are stored in databases.

Data analysts use SQL mainly because:
  • It’s easy to understand and learn
  • It is cross-platform i.e. the syntax of SQL can be used on almost all well-known industry-wide platforms such as MySQL, Postgres, Google Bigquery, etc.
  • A large amount of data can be accessed & analyzed directly using SQL without having to store the data in another application
    • It is far more efficient than using Excel when it comes to large data

Database

To start with, let's ask ourselves, what is a database. A database is a collection of tables. Table is a collection of related data held together in tabular format. A table contains:
  • Columns: Column is defined by its name and data type. The name is used in SQL statements, and the data type is used to validate information stored. 
  • Rows: Each row is uniquely identified by a primary key. A specific choice of columns that uniquely identify rows is called the primary key.

       Example of a table:

       Table Name: customerinfo (We will refer to the below table for all SQL queries)

A primary key is a field in a table that uniquely identifies each row/record in a table.


How will you get a list of tables from the database?

We can retrieve tables from a database using the below query:
SELECTFROM DB_Name.Tables; 
#Lists all the tables present in the database "DB_Name"


How will you get a list of tables containing a word?

If you want to look for all the tables that contain the character "customerinfo" use the below query:
SELECT TABLE_NAME 
      FROM INFORMATION_SCHEMA.tables
           WHERE LOWER(TABLE_NAME) LIKE '%customerinfo%'; 

#Lists all the tables present in the database in use which contains "customerinfo" in the table name. By using lower() we are ensuring case-insensitive searching.


Few basic queries to understand the values stored in the table:

Let us see what are the attributes of a table 'customerinfo' using the below queries:
SELECT TOP 5 * FROM customerinfo;  
#Lists first 5 records(rows) from table "customerinfo" along with fields (Column) name

SELECT * FROM customerinfo LIMIT 10;  
#Lists 10 records(rows) from table "customerinfo" along with fields (Column) name

SELECT CustomerName, Email_id FROM customerinfo;  
#Displays records for only the fields customername & Email_id

SELECT COUNT(*) FROM customerinfo;  
#Displays the # of records(rows) in the table


How to get distinct records?

Column often contains many duplicate values, and sometimes you only want to list the different (distinct) values. Then we use the below queries:

SELECT COUNT(DISTINCT CustomerName) FROM customerinfo;  
#Displays count of unique names present in CustomerName column

SELECT DISTINCT CustomerName, Age FROM customerinfo;  
#Displays unique row for column CustomerName & Age


How to only filter certain information from a table using SQL?

Now suppose for the analysis you only need records from a table where certain conditions are met, there we need to apply a filter on that table by using the 'WHERE' clause:

SELECT * FROM customerinfo WHERE CustomerName = 'Aditi'; 
#This query will help you retrieve all records from the table where column 'CustomerName' contains value 'Aditi'


Multiple conditions can be applied using "AND" & "OR" operators. Example:
SELECT * FROM customerinfo WHERE CustomerName = 'Aditi' AND Group_id = 33; 
#This query will help you retrieve records for customer whose name is Aditi and belong to group 33

Pattern Search

Similarly above query can be modified to use LIKE operator to search for a specified pattern in a column:
SELECT * FROM Table_name WHERE column1 LIKE '_d%'; 
#This query will help you retrieve all records from the table where the column1 contains just one character before d and contains 0 or any of characters post that


Aggregate Functions

GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns. GROUP BY statement groups rows that have the same values into summary rows. For example, if we would like to see # of customers belonging to each group_id:
SELECT group_id, Count(*) as cust_count FROM customerinfo GROUP BY group_id; 
Now suppose we want to apply filter on an aggregated function i.e. cust_count from the above query we use HAVING clause:
SELECT group_id FROM customerinfo GROUP BY group_id HAVING Count(*) > 2;  
#Will display group_ids with more than 2 customers

Note: WHERE can only be applied on a field which already exists either in the table or inner query;


Query across multiple tables (JOINS)

When we have to analyze more than one table we often need to use JOINS. JOIN clause combines rows from two or more tables, based on a related column between them

Types of Joins:

Inner Join

Ensures that you get only the records that have matching values in both tables.
SELECT Orders.Order_id, Customers.CustomerName 
         FROM Orders
              INNER JOIN
                   customerinfo ON Orders.Customer_id = customerinfo.Customer_id;
#This query will return the customer name for each order_id only if there is matching customer_ids in both tables 

Left Outer Join

Returns all records from the left table and only the matched records from the right table.
SELECT Orders.Order_id, Customers.CustomerName 
         FROM Orders
              LEFT OUTER JOIN
                   customerinfo ON Orders.Customer_id = customerinfo.Customer_id;
 
#This query will return all order_ids from the orders table and in the case where it finds matching customer_id in customer table then it will give the customerinfo name along with the order_id else customer name field will be null 

Right Outer Join

Returns all records from the Right table and only the matched records from the left table.
SELECT Orders.Order_id, Customers.CustomerName 
         FROM Orders
              RIGHT OUTER JOIN
                   customerinfo ON Orders.Customer_id = customerinfo.Customer_id;

#This query will return all customer names from customerinfo table and in the the case where it finds matching customer_id in orders table then it will give the order_id else that field will be null 

Full Outer Join

Returns all records when there is a match either in the left or right table.
SELECT Orders.Order_id, Customers.CustomerName 
         FROM Orders
              FULL OUTER JOIN
                   customerinfo ON Orders.Customer_id = customerinfo.Customer_id;

#This query will return non-null customer name and order_id where it finds matching customer_id in both tables and for non-matching cases will return records with customer name from one table having order_id field as null and order_id from other table with customer name field containing a null value


That is it from today. If you want me to write a blog on a specific topic, please subscribe to my blog and email me. I will be more than happy to share my knowledge.

In the upcoming post, I will cover more SQL use cases and show you how to use clauses (row_number, rank, ntile, etc), optimization technique & order of execution.

Do let me know your Feedback & Suggestions for this post!

Comments

Post a Comment