SQL

use between for date functions

get the 3rd highest salary

-- Row Number :
SELECT Salary,EmpName
FROM
  (
   SELECT Salary,EmpName,ROW_NUMBER() OVER(ORDER BY Salary) As RowNum
   FROM EMPLOYEE
   ) As A
WHERE A.RowNum IN (2,3)

-- Sub Query :
SELECT *
FROM Employee Emp1
WHERE (N-1) = (
               SELECT COUNT(DISTINCT(Emp2.Salary))
               FROM Employee Emp2
               WHERE Emp2.Salary > Emp1.Salary
               )

-- Top Keyword :
SELECT TOP 1 salary
FROM (
      SELECT DISTINCT TOP n salary
      FROM employee
      ORDER BY salary DESC
      ) a
ORDER BY salary

update_statistics after a bulk data transform

SELECT * FROM Employees WHERE EmpName like ‘S%’

Joins

  • inner
    • only keep what matches
  • left/right/full outer
    • left: keep all left table
    • right: keep all right table
    • full: keep all rows in both tables
  • cross
    • cross multiplies, number of output rows = num left rows x num right rows

Aggregations

  • count
    • count how many are in the group
  • sum
    • add up values in group
  • average
    • returns average of group
  • HAVING
    • to do a where clause on an aggregation
select age, sum(weight) as total_weight
from cats 
group by age 
having sum(weight) > 12 order by age DESC

CASE

execute parts of a SQL statement conditionally

CASE input_expression /* usually a column_name */
WHEN conditional_expression1 THEN result_expression1
[WHEN conditional_expression2 THEN result_expression2]
[ELSE result_expression]
END

-- Example
SELECT LANGUAGE_NAME, RATING_CODE AS RATING, 
CASE RATING_CODE
WHEN '3' THEN 'DIFFICULT'
WHEN '2' THEN 'FAIRLY DIFFICULT'
WHEN '1' THEN 'EASY'
ELSE 'UNKNOWN'
END AS RATING_DESCRIPTION
FROM PROGRAMMING_LANGUAGES;

Filtering

  • WHERE
    • The WHERE clause is used to extract only those records that fulfill a specified condition.
    • = Equal
    • <> Not equal. Note: In some versions of SQL this operator may be written as !=
    • > Greater than
    • < Less than
    • >= Greater than or equal
    • <= Less than or equal
    • BETWEEN Between an inclusive range
    • LIKE Search for a pattern
      • % is like *
      • _ is like ?
    • IN To specify multiple possible values for a column
    • IS NULL / IS NOT NULL
  • HAVING
    • For aggregate functions

Order

SELECT column1, column2, ...
FROM table_name
ORDER BY column1, column2, ... ASC|DESC;

Windows

http://dcx.sap.com/1200/en/dbreference/window-statement.html

-- Running total
select name, sum(weight) over (order by name) as running_total_weight
from cats
order by name

-- Partitioned running total
select name, breed, sum(weight) over (partition by breed order by name) as running_total_weight
from cats

-- Counting - ROW_NUMBER()
select row_number() over (order by color,name) as unique_number, name, color 
from cats

-- Ordering - RANK()
select rank() over (order by weight desc) as ranking, weight, name 
from cats
order by ranking, name DESC

-- ntile() lets us divide our data into percentiles/quartiles. 
-- If we only need to see the lowest X% this can be done via ntile()
-- Quartiles - NTILES()
select name, weight, ntile(4) over ( order by weight) as weight_quartile
from cats 
order by weight_quartile, name

-- Dense Rank - DENSE_RANK() 
-- assigns each group a rank sequentially
select dense_rank() over (order by age DESC) as r, name,age
from cats
order by r, name

-- Compare previous row
-- LAG() lets us compare to the previous rows [and lead() the next rows] [min() also works] It is useful for looking for strange step ups/downs in data
select name, weight, coalesce(weight - lag(weight, 1) over (order by weight), 0) as weight_to_lose
FROM cats
order by weight

-- Compare previous with partition
select name, breed, weight, coalesce(weight - lag(weight, 1) over (partition by breed order by weight), 0) as weight_to_lose
from cats
order by weight, name

-- FIRST_VALUE()
-- lets us get the first value in a subgroup
select name, color, first_value(weight) over (partition by color order by weight) as lowest_weight_by_color
from cats
order by color, name

-- Window
select name, weight, ntile(2) over ntile_window as by_half, ntile(3) over ntile_window as thirds, ntile(4) over ntile_window as quart
from cats window ntile_window AS ( ORDER BY weight)
order by weight, name

correlated subqueries

set operations

https://en.wikipedia.org/wiki/Set_operations_(SQL)

  • UNION
    • union: get distinct values, concatenates the tables
      • combines two result sets
      • select distinct on the result set
    • union all: get all values
      • faster cuz no distinct
  • INTERSECT
    • returns rows that are in both sets
  • EXCEPT
    • returns opposite of INTERSECT
    • like query result set1 - set2

Denormalize

sometimes the DWH is normalized and the data marts are denormalized

http://www.vertabelo.com/blog/technical-articles/denormalization-when-why-and-how

having redundant data records for the sake of better performance

  • Advantages
    • Maintaining history
    • improve query performance, fewer joins
    • speedup reports and dashboards
    • compute common values upfront
  • Disadvantages
    • more disk space because duplicate data
    • data anomolies, make sure data is changed in both places
      • use triggers, transactions, procedures to make sure everything is changed
    • documentation
    • slower modification
      • slower insert, update, and delete
    • more coding
      • but less coding for select queries

Star Schema

AKA dimensional data model

  • commonly used in data marts
  • central fact table with dim tables all connecting to it
  • each fact is surrounded by its dimensions

In dimensional modeling, it’s very important to define the right dimensions and choose proper granulation

  • Fact tables
    • quantitative measurements
    • aggregated dimensions together that has info
    • data we want in reports
    • stores a bunch of FKs in each row
    • how much and how many
    • can be fine grained or course grained/aggregate over a period of time
  • Dimension
    • classification information
    • who, when, where and what
    • they have columns like state_code and state_name so queries are faster

Snowflake

The snowflake is an extension to a dimension intended to reduce storage and duplication. It has the undesirable side effect of complicating and slowing queries.

Bridge Tables: for many to many relationships

Slowly Changing Dimension (SCD)

  • Type 1: overwrite
  • Type 2: new row with new data
  • Type 3: update row, column holds old_cust_code and current_cust_code

Data Mart - Front room

Data marts are like small pieces of a DWH. Helps with security and less confusing for user.

  • Top-Down: Data marts are created from the data warehouse. (This is something that Bill Inmon, the “father of the data warehouse”, would agree with, along with the idea that warehouses should be in 3NF.)
  • Bottom-Up: Data marts are created first, then combined into a data warehouse. (This approach is closer to what Ralph Kimball, a data warehouse and dimensional modeling expert, advocates.)

Data Modeling

http://www.vertabelo.com/blog/technical-articles/5-steps-for-an-effective-database-model

  1. Domain
    • What type of data
  2. Functionality
    • What will be done with the data
    • CRUD or CRU
    • How long will the data be kept
  3. Entities
    • what are the main entities
    • what types of interactions are there
  4. Relationships
    • FKs or loosly defined related IDs
  5. Formal Design