Crib Sheet CC-BY-NC

Maintainer: admin

1ER Model

Entity
real-world object distinguishable from other objects. An entity is described using a set of attributes
Entity set
a collection of similar entities
eg: all employees
each entity set has a key, a minimum set of attributes whose values uniquely identify an entity in the set
each attribute has a domain

Diagram 1

1.1ISA Hierarchies: Subclasses

A ISA B, then every A is also a B entity. Key is only in B

Diagram 2

E/R Viewpoint: An entity has a component in each entity set to which it logically belongs. Its properties are the union of these entity sets

1.1.1Constraints

Overlap Constraint
can an entity be in more than one subclass?
Covering Constraint
must every entity of the superclass be in one of the subclasses?

1.2Relationship

Relationship
association among two or more entities
eg: Employee A works in department X
Relationship set
Collection of similar relationships
An n-ary relationship set R relates n entities set E1...En. Each relationship in R involves entites $e1 \in E1, \ldots , en \in En$

Diagram 3

1.2.1Many-to-many

Each entity can have relationships with multiple entities

note: a relationship is uniquely defined by the primary keys of participating entities. That is Employee A cannot work twice in the department X

Diagram 4

1.2.2One-to-many

eg: manages. One employee manages several departments. Each department is managed by only one employee

Key Constraint
If an entity in an entity set can only have one other entity involved in the relationship, this is a key constraint
Depicted with an arrow from the one to the one-to-many relationship set

Diagram 5

1.2.3One-to-one

Represented with key constraints in both directions

Diagram 6

1.2.4Participation Constraints

Participation Constraint
Requires that every entity in the entity set participating in the relationship MUST have a relation
represented with a thick line from the constrained entity set to the relationship

Diagram 7

Can also combine key constraints with participation constraints

Diagram 8

1.2.5Weak Entities

Weak entity
can be indentified uniquely ONLY by considering the primary key of another entity
The key in this kind of set is the union of the keys of the owner entity set and the set of its own attribtues.
represented with a bold rectangle

Diagram 9

1.2.6Ternary Relationship

Diagram 10

1.2.7Aggregation

Aggregation
Allows us to treat a reltionship set R as an entity set so that R can participate in other relationships

Diagram 11

2Relational Model

ER is a semantic model. Relational model is a DATA model

Relation
Consists of a schema and an instance
Schema
specifies name of relation, plus a set of attributes, plus the domain/type of each attribute
Instance
set of all tuples
tuples are distinct
Database Schema
Collection of relation schemas

A relation can be seen as a table. Column headers = attribute names, rows = tuples/records, columns/fields = attribute values.

The number of rows = cardinality. Number of fields = degree / arity

All rows are distinct. Rows are not ordered

DDL
Data defintion language
defines the schema of a databse
DML
Data manipulation language
manipulates the data (instances of the relations)
eg: Insert, update, delete
Query the relationship
aka Query Language

2.1SQL

Structured Query Language (SQL)

Used to define relations and to write/query data

2.1.1Data Types

Name Description
CHAR(n) A character string of fixed length n
VARCHAR(n) denotes a string of up to n charaters
INT or INTEGER An integer
SHORTINT Shorter version of an integer
FLOAT or REAL floating point number
DOUBLE PRECISION eg: Double in java
DECIMAL(n,d) real number with fixed decimal pooint. Value of consits of n digits with a decimal point d positions from the right
DATE In the form YYYY-MM-DD
TIME Form is 15:00:02 or 15:00:02.5
Bit strings ???

2.1.2Table Creation

CREATE TABLE Students
(sid INT,
 name VARCHAR(20),
 login CHAR(10),
 major VARCHAR(20)
     DEFAULT 'undefined')

Defines all attributes of the relation and the type/domain of each.

SQL is case insensitive

2.1.3Destorying/Altering Relations

DROP TABLE Students

Destroys the schema AND the tuples are deleted

ALTER TABLE Students
    ADD COLUMN firstYear:integer

Adding a new row results in all existing tuples gettings a new field with the null value

2.1.4Insert/Delete/Update Tuples

Inserting a single tuple

INSERT INTO Students (sid, name, faculty)
    VALUES (53688, 'Chang', 'Eng')

Deleting Multiple Tuples

DELETE FROM Students WHERE name = 'Chang'

Updating Tuples

UPDATE Students SET faculty = 'Science' WHERE sid = 53688

2.1.5Integrity Constraints

An Integrity Constraint Must be true for any instance of the database. There are domain specific constraints that are already enforced (eg on INT)

DB System will enforce any constraint and reject a modification that violates it

2.1.5.1NOT NULL
CREATE TABLE Students
(sid INT NOT NULL,
 name VARCHAR(20),
 login CHAR(10),
 gpa REAL
     DEFAULT 0.0)

requires an attribute to always have a proper value

2.1.5.2Primary Key Constraints

A set of fields is a key for a relation if:

  1. No two distinct tuples can have the same values of all key fields
  2. This is no true for any subset of the key

Any key that satisfies these properties is considered a candidate key. One is chosen to be the primary key

Primary key attributes may not be NULL

CREATE TABLE Students
(sid INT NOT NULL PRIMARY KEY,
 name VARCHAR2(20))

CREATE TABLE Enrolled
(sid INT NOT NULL,
 cid CHAR(20) NOT NULL,
 grade CHAR(2) NOT NULL,
 PRIMARY KEY (sid,cid),
 UNIQUE (cid,grade))

sid makes a good primary key in the first one. In Enrolled we want each student to have ONE grade per class!

2.1.5.3Foreign Key
Foreign Key
set of fields in one relation that is used ot "refer" to a tuple in another relation
must correspond to the primary key of the second relation

represents a "logical pointer"

CREATE TABLE Enrolled
(sid INT NOT NULL,
PRIMARY KEY(sid)
FOREIGN KEY (sid)
REFERENCES Students)

Foreign Key Constraint

foreign key value of a tuple must represent an existing tuple in the referred relation.

Tuples that refer to nonexistent foreign keys are not allowed to be inserted.

What happens when a refered tuple is deleted?
Delete all tuples that reference it
Disallow deletion as long as something refers to it
set the foreign key to a feault value
When the primary key is updated
updated all tuples that reference it
or do something similar to deletion (disallow, delete referencing tuples)

SQL default to disallowing. Can use CASCDE

eg:

CREATE TABLE Enrolled
 (sid INT NOT NULL,
  cid CHAR(8) NOT NULL,
  grade CHAR(2),
  PRIMARY KEY (sid,cid)
  FOREIGN KEY (sid)
  REFERENCES Students
  ON DELETE CASCADE
  ON UPDATE CASCADE)

Integrity Constraints are based on the semantics of the realworld. You can check that an IC holds by looking at the whole table, but you can't assume from examing one or some tuples

2.2Translating ER to R

2.2.1Entity Sets to Relations

Diagram 11

One relation per entity set

2.2.2Many-to-Many relationship sets

Always translated as an individual table. Attributes for the table are the keys for each participating set (as foreign keys) and all descriptive attributes

Diagram 12

2.2.3Key Contraints

If one-to-many make the one the primary key, hence guaranteeing uniqueness. Entity set and relationship each gets a table

OR

Include relationship set in table of the entity set with the key constraint as a field.

2.2.4Key Contraints and participation contraints

Include relationship set in table of the entity set with the constraints

2.2.5Partitipation Conatraints

Usually can't be done UNLESS also key constraint

2.2.6Renaming

Might happen if a relation involves to entities from the same set.

Diagram 13

2.2.7Weak Entity Sets

Weak identity set and identifying relationship set are translated into a single table. Must cascade on deletion of primary foreign key

Diagram 14

2.2.8ISA Hierarchies

2.2.8.1General approach

distribute information among relations. Super class defines general attributes and key. Subclasses are relations with additional attributes and superclass key. subclasses must be altered when superclass altered.

Diagram 15

2.2.8.2object-oriented approach

sub-classes have all the attributes

Pro Con
Query asking about just subclass entities only has one table to search Query on attributes of all superclass entities have to traverse all tables of subclasses
Entities that are in two subclasses are stored twice
2.2.8.3One big relation approach

One big table with all the attribtues of all subclasses (can be null)

2.2.9Aggregation

Diagram 16

Option 1: No key contraints
Basically just create a table for the aggregation relation that references the keys of all involved
  • Projects(pid,started_on,pbudget)
  • Departments(did,dname,budget)
  • Employees(eid,name,salary)
  • Sponsors(pid,did,since)
  • Monitors(pid,did,eid,until)
Option 2: Use inner relationship
Key constraint on relationship within the aggregation and the other entity set
eg: Sponsor(pid,did, eid, since, until)
Option 3: Use Outter Relationship
No Sponsors
Project(pid, started_on, pbudget, did, since)
Monitors(pid,eid,until)

3Relational Algebra

Basics

Input
input one or two relations
schema of each relation is known
instance can be arbritrary
Output
a relation
schama of output relation depends on operator and input relations

Relational algebra is closed

Name Symbol Description
Selection $\sigma_{condition}(R_{in})$ Selects a subset of tuples from a relation
Projection $\Pi_{attribute}(R_{in})$ Projects to a subset of attributes from a relation
Renaming $\rho(R_{out}(B1, \ldots, Bn), R_{in}(A1, \ldots, AN))$ Rename relations or attribtues
Cross product $\times$ Combines two relations. Each row of first input combined with each row of second input
Join $\bowtie$ Combination of cross product and selection
Division \ TODO
Intersection $\cap$
Union $\cup$
Set Different - Tuples that are in the first, but not the second

Operators can be composed

Set operators must be on relations with the same number of attribtues and same types

3.1Types of Joins

  1. Condition/Theta Join $R_{out} = R_{in1} \bowtie_c R_{in2} = \sigma_c(R_{in1} \times R_{in2})$
  2. Equi Join: $R_{out} = R_{in1} \bowtie_{Rin1a1 = Rin2b1, \ldots, Rin1an = Rin2bn} R_{in2}$ Condition join where condition contains ONLY equalities
  3. Natural Join: Equijoin on all common attributes

3.1.1Division

Let A have have 2 fields, x and y, have only field y.

$A/B = \{ x | \forall y \in B \exists x,y \in A\}$

A/B containsall x typles such athf ro every y tuplein B there is an xy tuple in A

3.1.2Rules of Relational Algebra

Let $R,S,T$ be relations. $C, C1, C2$ are conditions. $L, L1, L2$ are projection lists on R and S.

Communitivity
order of operands does not matter (except renaming, division and difference)
$\Pi_L (\sigma_C (R)) = \sigma_C (\Pi_L (R))$
$R1 \bowtie R2 = R2 \bowtie R1$
Associativity
You can move brackets around
$R1 \bowtie (R2 \bowtie R3) = (R1 \bowtie R2) \bowtie R3$
Idempotence
$\Pi_{L2} (\Pi_{L1}(R)) = \Pi_{L2} (R)$ if $L2 \subset L1$
$\sigma_{C2} (\sigma_{c1}(R)) = \sigma_{C1 \wedge C2}(R)$

4Functional Dependicies

4.1Types of Problems

Suppose we have Hourly_Emps(sin, rating, hourly_wages, hourly_worked)

Diagram 16

Wages is related to rating

  1. Update Anomly: Can we change the wages in only the first tuples?
  2. Insertion Anamoly: What if we want to insert an employee and don't know the hourly wage for his/her rating?
  3. Insertion Anamoly II: We can only introduce a new rating scheme when we enter an emplyee with this rating into the relation
  4. Deletion Anamoly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!
  5. Storage Redundancy

4.2Functional Dependencies

A functional dependency $X \to Y$ ($X$ and $Y$ are sets of attributes; $X,Y \subset$schema($R$)) holds over relation $R$ if for every allowable instance $r$ of $R$:
* Given tuples $t1$ and $t2$ in instance $r$ of $R$, if the X values of $r1$ and $t2$ agree, then the $Y$ values must also agree
* $t1 \in r, t2 \in r, \Pi_x(t1) = \Pi_x(t2) \to \Pi_y(t1) = \Pi_y(t2)$

FDs are determined by the semantics of the application.

Key Candidates
Let schema$(R) = (a1, a2, ... an)$; a set of attribtues $X \subset$ schema$(R)$ is a candidate key for a relation $R$ if $X \to R (= X \to a1,a2, \ldots, an)$
That is some set of attributes of R (could be a single element set) implies all of the attributes of the relation

In the first example, we would move rating -> wage into it's own table and then just store rating!

4.3FD Axioms

($X,Y,Z$ are sets of attributes)

pname Description
Reflexivity If $Y \subset X$ then $X \to Y$
Augmentation $X \to Y$ then $\forall Z, XZ \to YZ$
Transitivity If $X \to Y, Y \to Z$ then $X \to Z$
Union If $X \to Y, X \to Z$ then $X \to YZ$
Decomposition If $X \to YZ$, then $X \to Y, X \to Z$
Armstrong's Axioms
Reflexivity
Augmentation
Transitivity
$F^+$
closure of F
set of all DFs that are implied by F through Armstrong's Axioms
Algorithm for finding the attribute closure of $X \to X^+$ wrt F:

Input: set of all attributes $A$ such that $X \to A$ is in $F^+$

Basis: $X^+$
Induction: If $Z\subset X^+$ and $Z \to A$ is in $F$, add $A$ to $X^$
End when $X^+$ cannot be changed

Finding all key candidates

X is akey candidate if X+ = R AND there is no subset Y of X such that Y+ = R

X has to imply everything and it has do do so minimally

4.4Normalization

If only key-induced FDs hold there is no redundancy in the database

4.4.1BCNF

Boyce-Codd Normal Form

A relation R with FDs F is BCNF if for all $X \to A$ in F+:
* $A \in X$ (trivial FDs)
* $X$ contains a key for $R$

R is in BNCF if the only non-trividal FDs that hold over R are key constraints

4.4.23NF

A relation R with FDs F is in 3NF (3rd Normal Form) if for all $X\ \to A$ in F+

  • $A \in X$ (trivial FDs)
  • X Contains a key for R (X is a superkey) or
  • A is a part of some key for R

4.5Decomposition

If a relation is not in the desired normal form we decompose it.

Given a relation R with attributes A1, ..., An, a decomposition of r consists of replacing R by two or more relations such that:
each new scheme contains a subset of attributes of R and nothing more
every attribute of R appears as an attribute on one of the new relations
* The ruples of R are split such that corresponding subtuples appear in the enw relations

Disadvantages: some queries now have to look at more tables and are thus more expensive

Loseless Join decompositions
The decomposition of R into R1 and R2 is lossless-join wrt F if and only if the closure of F contains: Let {a1, ..., an} be the set of attribute sin R1 and R2: then {a1, ..., an} $\to$ R1 or {a1,..., an} $\to$ R2
In particular if $U \to V$ (non trivial) holds over R then the decomposotion of R into UV and R - V is lossless-join (rejoin does not create new tuples because U is key in UV and foreign key in R-V)
Denoted as a decomposition along FD $U \to V$

4.5.1Decomposing into BCNF

Given a relation R with FDs F and $X \to Y \in FD$ violating BCNF:

  • Compute X+
  • Decompose R into X+ and $(R - X+) \cup X$
  • Repeat until all relations in BCNF

If $X^+ = \{X,Y\}$ we denote this as a decomposition along $X \to Y$

Dependency Preserving Decomposition
Given a relation R with FDs F. A depdendency preserving decomposition allows us to check each dependency in F by looking at one of the decomposed relations
Not all BCNF decompositions are dependency preserving

4.5.2Decomposing into 3NF

Any relation R can be decomposed into relations that are 3NF and dependency preserving

5SQL

SELECT desired attribtues
FROM list of relations
WHERE qualification (optional)

Conversion to relational algebra

SQL Algebra
SELECT projection
WHERE selection and join

Sample: Diagram 17

SQL statement:

SELECT rating, age
FROM Skaters
WHERE rating >= 10 OR age > 15

Relational Agebra:

$\Pi_{rating, age}(\sigma_{rating >= 10 \lor age > 15}(Skaters))$

The only difference is that SQL does not eliminate duplicates

5.1WHERE clause

Comparision
attribute1 op constant
atttribute1 op attribute2

op is one of <, =, >, <>, <=, >=, LIKE

Can use arithematic operations on numbers

Qualification/Condition
name = 'Cheng' AND NOT age = 18
name LIKE '%e_g' %: any string, _: any character
there are other string operators like string-length, concatentation

SQL uses multiset semantic, which allows duplicates, except on primary keys

DISTINCT``` is an optional keyword that will eliminate duplicates

eg:
```sql
SELECT DISTINCT sname FROM Skaters
Renaming Columns
SELECT sid, sname AS skater FROM Skaters WHERE raing < 9
Expressions
SELECT sname, rating+1 AS updgrade FROM Skaters
Constants
SELECT rating AS reality, '10' AS dream FROM Skaters
Will add colum dream with value 10 to output
Ordered Output
SELECT * FROM Skaters ORDER BY age, rating
will wort by age THEN rating

5.2Multirelational Queries

You can have MULTIPLE relations in the FROM clause. Relation.Attribute is used to disambiguate

Cross Product
SELECT * FROM Skaters, Participates
Join
SELECT sname FROM Skaters, Participates WHERE Skaters.sid = Participates.sid
OR SELECT sname FROM Skaters JOIN Participates ON Skaters.sid = Participates.sid

5.2.1Union, Intersection, Difference

Input relations for set operators must be set-comaptible. Result relation is a set (not a multiset)

5.2.1.1Union

Find skaters that have participated in a regional or a local competition

SELECT P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'local'
UNION
Select P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'regional'

Multiset version: UNION ALL

5.2.1.2Intersection

Find skaters that have participated in a regional AND a local competition

SELECT P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'local'
INTERSECT
Select P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'regional'

Multiset Version: Intersect ALL

5.2.1.3Diference

Find skaters that have participated in a local but not in a regional

SELECT P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'local'
EXCEPT
Select P.sid
FROM Participates P, Competition C
WHERE P.cid = C.cid AND C.type = 'regional'

Multiset version: EXCEPT ALL

5.2.2Nested Queries (IN)

A where clause and contain an SQL query which is called a subquery

Find names of skaters who ahve participated in competition #103

SELECT S.sname
FROM Skaters S
WHERE S.sid IN (SELECT P.sid
                FROM Participates P
                WHERE P.cid = 103)

Can invert selection using NOT IN

Basical it is WHERE (a1, a2) in (SELECT a3, a4...)

5.2.3Exists

EXISTS is TRUE iff the relation is non-empty

Find names of skaters who have participated in competion 103

SELECT S.sname
FROM Skaters S
WHERE EXISTS (SELECT *
              FROM Participates P
              WHERE P.cid = 103 AND
                    P.sid = S.sid)

5.2.4Quantifiers

Can use ANY and ALL (like universal and existential quantifiers)

Syntax
WHERE attr op ANY (Select ...)
WHERE attr op ALL (SELECT ...)

Find highested rated skater

SELECT *
FROM SKATERS
WHERE rating >= ALL (SELECT rating FROM Skaters)

5.2.5Division

Find skaters who have reserved all boats

SELECT sname
FROM Skaters S
WHERE NOT EXISTS ((SELECT C.cid
    FROM Competition C)
    EXCEPT
    (SELECT P.cid
    FROM Participates P
    WHERE P.sid=S.sid))
SELECT sname
FROM Skaters S
WHERE NOT EXISTS (SELECT C.cid
    FROM Competition C
    WHERE NOT EXISTS (SELECT P.cid
        FROM Participates P
        WHERE P.cid = C.cid AND
            P.sid = S.sid))

6Advanced SQL

6.1Aggregation

Counting
SELECT count(*) FROM Skaters
Averages
SELECT avg(age) FROM Skaters WHERE rating = 10
Distinct
SELECT count(DISTINCT sid) FROM Participates

Also, SUM, MAX, MIN

6.1.1Grouping

SELECT target-list
FROM relation list
WHERE qualification
GROUP BY grouping listn

A group is a set of tuples that ahve the same value for all attributes in the grouping list. One answer tuples is generated per group. Target-liust contains attributes and/or aggregation terms

eg:

SELECT rating, avg(age)
FROM Skaters
GROUP BY rating

If any aggregation is used, then each element in the attribute list of the SELECT clause must either be aggregated or appear in a group-by clause

6.1.1.1HAVING clause

These are just WHERE claues on GROUP BY

eg: Find average age of skaters with at least two skaters at each rating level

SELECT rating, avg(age)
FROM Skaters
GROUP BY rating
HAVING COUNT(*) >= 2

Aggregate operations cannot be nested!

6.2Views

A view is an umaterialized relation: we store a definition rather than tuples, and get the tuples from existing tables

CREATE VIEW ActiveSkaters (sid, sname)
    AS SELECT DISTINCT S.sid, S.sname
    FROM Skaters S, Participates P
    WHERE S.sid = P.sid

6.3NULL Values

Results in issues for comparisons and arithemtic. Need 3-valued logic (true, false, unknown)

  • NOT unknown = unknown
  • A or B = true if either A = true of B = true
  • A or B = false if A = false and B = false
  • A or B = unknow if A = false and B = unknown, or vice versa, or A = B = unknown
  • A and B = true if A = true and B = true
  • A and B = false if either A or B = false
  • A and B = unknown if A = true and B = unknown, or vice versa, or A = B = unknown

In SQL the WHERE clause eliminates rows that are not true (ie: false or unknown)

6.4Out Join

Same as an inner join, but with dangling tuples paadded with NULL

A regular join on Skaters and Participates only includes result tuples that are in both Skaters and Participates.

An (left) outer join on Skaters (left) and Participates includes Skater tuples that weren't in Participates and pads the missing information with nulls

SELECT S.sid, Sname, P.cid
FROM Skaters S LEFT OUTER JOIN Participates P
ON S.sid =  P.sid

6.5Levels of Abstraction

A single conceptual/Logical Schema defines the logical strucre. Conceptual db design

A Physical Schema describes the files and indexes used

Views describe how users see the data

Physical Data Independence: The conceptual schema protects from changes in the physical structure of data

Logical data independence: External schema proctects from changes in conceptual schema of data

6.6DB Modification

6.6.1Inserts

INSERT INTO SKATERS VALUES (68, 'Jacky, 10, 10)

OR

INSERT INTO Skaters (sid, name) VALUES (68, 'Jacky')

Inserting the result of a query

INSERT INTO ActiveSkaters (
    SELECT Skaters.sid Skaters.name
    FROM Skaters, Participates
    WHERE Skaters.sid = Participates.sid)

6.6.2Deletion

DELETE FROM Competitions WHERE cid = 103

or

DELETE FROM Competition

Will delete EVERYTHING

6.6.3Updates

UPDATE Skaters
SET ranking = 10, age = age + 1
WHERE name = 'debby' OR name = 'lilly'

7SQL Integrity

7.1Constraints

  • Domain Constraints (data type, UNIQUE, NOT NULL)
  • Primary Key Constraints
  • Foreign Key Constraints
  • Attributes and tuple based checks
  • SQL2 Assertions: global constraints (just check that conditions aren't violated, rejection only behaviour)
  • SQL3 Triggers: upon user modification, checks are performed and reactions specified

7.1.1Attribute Based Checks

CREATE TABLE Skaters (
sid INTEGER PRIMARY KEY NOT NULL,
sname VARCHAR(20),
rating INTEGER CHECK(rating > 0 AND rating < 11),
age INTEGER)

Condition is check when the attribute is changed. Condition violations result in rejections. Can be anything that would follow a WHERE clause including subqueries

eg: sname VARCHAR CHECK (sname NOT IN (SELECT name FROM forbidden))

DB2 does not allow subqueries or refences to other attributes...

7.1.2Tuples Based Checks

CREATE TABLE Skaters (
sid INTEGER PRIMARY KEY NOT NULL,
sname VARCHAR(20),
rating INTEGER,
age INTEGER
CHECK (rating <= OR age > 5))

Checked on update and insert

7.1.2.1Naming Constraints

May need to add or delete constraints laters. So we can NAME THEM

CREATE TABLE Skaters (
sid INTEGER NOT NULL,
sname VARCHAR(20),
rating INTEGER CONSTRAINT rat CHECK(rating > 0 AND rating < 11),
age INTEGER
CONSTRAINT pk PRIMARY KEY (sid),
CONSTRAINT ratage CHECK
    (rating <= 4 OR age >5))

Allows us to do ALTER TABLE Skaters DROP CONSTRAINT ratage or ALTER TABLE Skaters ADD CONSTRAINT rateage CHECK (rating <= 5 OR age > 5)

7.1.3Assertions

predicate expressing a condition that we wish the database to always satisfy and what to do when it fails

CREAT ASSETION smallClub CHECK
    ((SELECT COUNT(*) FROM SKATERS) + (SELECT COUNT(*) FROM Competitions) < 100)

Check whenever any relations involved are modified

7.1.4When and How much?

Attribute based checks happen on inserts and upates of that just one tuples. Tuple based checks same thing (cheap)

Assertion, insert or update or delte of tuples involved in any relation (expensive)

7.2Triggers

Trigger
Prodcued that starts automatically if specified changes occur to the DBMS

Made up of three parts

  1. Event (activates the trigger): usually insert/update/delete
  2. Condition (test whether the trigger should run)
  3. Actions (what happens if the trigger runs)

7.2.1Statement Level Trigger

CREATE TRIGGER updateSkater
    AFTER DELETE ON Skaters
REFERENCING NEW_TABLE AS DeletedSkaters
FOR EACH STATEMENT
    INSERT
        INTO StatisticsTable(ModTable, ModType, Count)
        SELECT 'Skaters', 'delete', COUNT(*)
        FROM DeletedSkaters

7.2.2Row Level Trigger

CREATE TRIGGER ratingIncrease
    AFTER UPDATE OF rating on Skaters
    REFERENCING
        OLD AS o
        NEW AS n
FOR EACH ROW
    WHEN (n.rating > 1 + o.rating)
        UPDATE Skaters
        SET skating = 1 + o.rating
        WHERE sid = n.sid

7.2.3Trigger Components

When?

  • AFTER UPDATE OF
  • BEFORE UPDATE OF
  • INSTEAD OF UPDATE

(could be UPDATE, INSERT, DELETE)

Scopes?

  • Row level trigger: references OLD/ NEW
  • Statement level: OLD_TABLE / NEW_TABLE

Indicated in the RERFENCING command

Can execute SEVERAL actions

FOR EACH ROW/STATEMENT
WHEN ...
BEGIN ATOMIC
    do thing1
    do thing2
END

Can also set SQL States. eg: SIGNAL SQLSTATE '75000' SET MESSAGE_TEXT='Rating increase must be by one'

8XML

XML is built with elements

Elements components
Name
Text (eg: foundations)
Child elements (nesting)
Attribtues (included in starttag)

Attribtues are made of names and values eg: <p id="first one">

8.1Tree model

Eache element is a node. Nested elements build a path. The leaves are attributes, text or just elements with neither. Element siblings are ordered

8.2Graph Model

Each element can have an ID attribute allowing elements to references other elements. This produces a graph

8.3DTD

A valid xml documents conforms to a Document Type Description (DTD) although they are optional. Has a BNF-life grammar which established constraints on element structure, attributes and content

Sample:

<!DOCTYPE people[
    <!ELEMENT people(person*)>
    <!ELEMENT person(name*, (lastname|familyname)?)>
    <!ATTLIST person PID ID #REQUIRED
        age CDATA #IMPLIED
        children IDREFS #IMPLIED
        mother IDREF #IMPLIED
    >
    <!ELEMENT name(#PCDATA)>
    <!ELEMENT lastname(#PCDATA)>
    <!ELEMENT familyname (#PCDATA)>
]>

Data types: PCDATA (parsed character data) or CDATA (unparsed)

Attributes
ID unique identifier (similar to primary key)
IDREF: reference to single ID
IDREFS: space-seperated list of references
Values
can give a default value
#REQUIRED must exist
#IMPLIED optional

Specified in an XML file with <!DOCTYPE name SYSTEM "path/to/thing.dtd">

Can use regex style things too. * is 0 or more. + is 1 or more, (a | b)? is one or the other

8.4XPath

Specificy a path expresion to match XML data by navigating down (or accross) the XML tree. returns a node-set

Examples
/bibliography/book/author all author elements by root navigating through those elements
/bibliography/book/@ISBN All ISBN attributes
//title all title elements anywhere in the document
/bibliography/*/title titles of bibliography entries assuming that there could be books, journals, reports, etc...
/bibliography/book[@year>1995] returns books where the year > 1995
/bibliography/book[author='FooBar']/@Year returns the years of books written by FooBar
/bibliography/book[count(author) <2]
/bibliography/book/author[position()=1]/name position is the location of the node in the node set

8.5XQuery

FLOWR = For/let-where-orderby/return

Diagram 18

For vs Let

For binds node variables for iteration

Let binds collection variables for one value

Diagram 19

VS

Diagram 20

Aggregation

let $a = avg(document("bib.xml")/bib/book/price)
for $b in document("bib.xml")/bib/book
where $b/price > $a
return $b

Can use joins with where $b1/author = $b2/author

subqueries, sorting, if-then-else, existential and universally quantified

eg: ```where some $b in $a/book satisfies count($b/author) > 10$

8.6XML in DB2

Storage as extra attribute CREATE TABLE MyXml (id INT NOT NULL PRIMARY KEY, Info XML)

Insert with:

INSERT INTO MyXML(id, INFO) VALUES (1000,
    '<customerinfo cid="1000">
    <name>Kathy Jones</name>
    <addr country =Canada">
        <street>123 fake</street>
        <city>Ottawa</city>
        <prov-state>Ontario</prov-state>
        <pcode-zip>H0H 0H0</pcode-zip>
    </addr>
    </customerinfo>')

8.6.1Querying with XQuery

db2-fin:xmlcolumn

XQUERY
    for $d in db2-fn:xmlcolum('MYXML.INFO')/customerinfo
    where $d/addr[city="Toronto"]
    return $d/name

SELECT and Xquery

SELECT XMLQUERY (
'for $d in $doc/customerinfo
    return $d/name'
    passing INFO as "doc")
FROM MyXml m
WHERE XMLEXISTS(
    '$i/customerinfo/addr[city="Toronto"]' passing m.info as "i")

9Transactions

Imagine a money transfer Transfer(id1, id2, value)

SQL

SELECT balance into :balance
FROM accounts
WHERE accoundid = :id1

balance += value

UPDATE accounts
SET balance = :balance
WHERE accountid = :id1

SELECT balance into :balance
FROM accounts
WHERE accountid = :id2

balance -= value

UDPATE accounts
SET balance = :balance
WHERE accountid = :id2
  1. read(A)
  2. A := A + 50
  3. write(A)
  4. read(B)
  5. B := B - 50
  6. write(B)

A transaction is the DBMS'sabstract view of a user program. It is a sequence of read operations, r(x) and write operations w(X) on objects x (tuple, relation) of the db

Read: bring obkect into main memory from disk, send variable to application

Write: bring object into main memory from disk and modify it. Write it back to disk (might be done later)

9.1ACID

ACID
Atomicity: A transaction is atomic if it executed in it entirety or not at all
Consistency: A transaction must preserve the consistency of the data
Isolation: In case that transaction is executed concurrently, the result must be the same as if they were executed serially
Durability: The changes made by a transaction must be permanent
Consistency
Each transaction must leave the database if the DB is consistent before the tranasction started. Hard to enforce at the db level unless using integrity constraints
Isolation
DBMS can execute transactions concurrently. Enforced by a concurrency control protocol.
Atomicity
A transaction either commits or aborts
In the case of an abort then you have to undo all the operations done
Local Recovery ensures that partial results can be done. Writes keep a pre-image
Durability
There must be a guarantee that changees introduced by a transaction will surive failures
Diagram 19

In the event of a crash we have to do Global Recovery. Involves ensuring commited transactions are in the db, that aborted changes are undone and that tranasctions at the time of the crash are undone.

9.2Buffers and the Disk

When are changes written from main memory to the disk?

Diagram 20

9.2.1FORCE flush

Force all changes of transaction are flushed to the disk BEFORE commit. Eg writes are done straight to disk

Easy recovery for commited or aborted tranasctions, but need cleverness to deal with inprogress transactions

9.2.2NOSTEAL flush

Changes are never flushed before a commit. Once committed, changes can be flushed at leasure. Inprogress and aborted transactionsa are easyt to deal with, nothing on disk. Need to deal with commited but potentially not written transactions.

Ideally we would flush at commit time, but this isn't possible due to I/O not being atomic.

10Indexing

Cost model for data access. We just oncsider number of disk reads.

Types of opertions
scan over all records SELECT * FROM Skaters
Equality search SELECT * FROM Skaters WHERE sid = 100
Range Search SELECT * FROM Skaters WEHRE age > 5 and age <= 10
Insert INSERT INTO Skaters VALUES (23, 'lilly, 10, 8)
Delete could be equality or range
udpate is just delete + insert

10.1Heap Files

Linked, unordered list of all pages of the file

  • scan retriving all records = good because all pages retrieved anyway
  • equality search on primary key = bad have to read ~1/2 of pages
  • range search = bad
  • insert = good, low cost
  • delete/update same as search

10.2Sorted Files

Records ordered according to one or more attributes of the relation

  • scan retriving all records = good
  • equality search on sort attribute = good about log2(number of pages)
  • range search = good find first one, then scan along
  • insert = bad, have to find proper page, need to reorganize
  • delete/update same as search. update may be expensive
  • sorted output: good if on sorted attribute

10.3Indexes

Sorted files can only be ordered by one attribute. Solution is to build an index for any attribute that is frequently queried. Created on a search key (different from primary key)

CREATE INDEX name ON Skaters(sid)
DROP INDEX name

B+tree is the most widely used index. Each ndoe represents on page (the transfer unit to/from disk). Leafs contain data entries which consist of value of the search key and the record identifier (rid = page-id, slot). Data entries are POINTERS. Root and inner ndoes have auxiliary index entries. They are height balanced.

fanout = number of children for each node. Minimum 50% occupancy fornot the root.

Indirect indexes can be in the form <key, rid> but could be in form <key, list of matching rid>

Direct indexes can contain the actual tuple, but expensive.

10.3.1Clustered vs Unclustered

Clustered
relatin in file sorted by search key attributes of the idnex
Unclustered
relation in heap file or sorted by attribute different to the search key attribute of the idnex

Can be clustered by at most one index

Diagram 21

10.3.2Examples

Relation R(A,B,C,D,F)

A and B are int (6 byte), C-F are char[40] (1 byte per char). Tuple = 172 bytes.

200,000 tuples

Each data page has 4000 bytes and is around 80% full.

Number of pages = (number of tuples * tuple size)/(fill rate * page size). in this case 10,750 pages

B values are uniformilly distributed

Non-clustered index B-tree with <k, list of rid>

Index pages are 4K and between 50%-100% fulll

rid = 10 bytes

size of pointer in intermediate page = 8 bytes

Index entry size in root/intermediate pages = size(key) + size(pointer) 6 + 8 = 14 bytes

Average number of rids per data entry = number of uples / different values (if uniform). eg: 200,000/20,000 = 10

Average length per data entry = size of key + (number of rids * size of rid). eg: 6 + 10*10 = 106

Average number of date entrier per page = fill-rate * page size / length of entry. eg 0.75*5000/106 = 28 entries per page

Estimate number of leaf page = number of different values / number of entrier per page. eg: 20,000 / 28 = 715

Number of entries in intermediate pages = fill-rate * page size / lneght of index enty. eg: 0.5 * 4000 / 15 = 143, 1*4000/14 = 285

Height of tree = Number of leaf pages / (min | max)? number of entries in intermediate pages

11Query Evaluation

11.1Processing SQL Queries:

  • parser translate into internal query representation (syntax, references existing stuff, view translation)
  • query optimizer turns the query into an efficient execution plan
  • plan executor executes the execution plan (nowai!) and delivers data

Query decomposition is variable because order can change, need to pick the best / most efficient

Access Path
method used to retrieve set of tuples from a relation

Cost models

  1. number of pages
  2. number of I/O (one page can be read from disk multiple times)

Query Processing operations

  • Selection
    • Scan without an index
    • scan with an index
  • Sorting
  • joining
    • nested loop join
    • Sort-merge join
    • etc...
  • grouping and duplicate elimination

Number of tuples in table P = CARD(P).

11.1.1Simple Selections

No index
Search on arbitrary attribute: scan the whole relation. Cost = number of pages
Search on primary key attribute: Cost = 1/2 number of pages on average
With index on selection attribtues
Cost = number of index pages + number of data pages

Clustered index number of data pages = (number of matching tuples / tuples per page)

Unclustered number of data pages = number of tuples!

11.2External Sorting

N pages, B

Pass 0
bring B pages into buffer, sort in main memory
cost N/B
Pass 1
...

Diagram 21

Diagram 22

11.3Equality Joins

  1. Simple Nested Loop join
  2. Page oriented nested loop join
  3. Block oriented nested loop join
  4. index nexted loop join
  5. Sort-Merge Join

11.3.1Nested Loop Joins

11.3.1.1Simple nest loop join

for each tuple in the outer relation P, scan the entire inner relation S

Cost = Number of outer page + number of outer tuples*(number of inner pages)

11.3.1.2Page oriented nested loop join

for each page of outer, get each page of inner and write out matching pairs.

Cost = Number of outer pages + Number of Outer Pages * Number of Inner Pages

11.3.1.3Block Oriented Nest Loop Join

If the smaller table fits in main memory + 2 extra buffer pages: load smaller into main and compare page by page on larger

Cost = Inner Pages + Outer Pages

If no relation fits in main meory:

For each block of B-2 Pages of Outer

  • load block into main memory
  • get each page of inner and join with B-2 pages of outer. write output to remaining page

Cost = Number of Outer Pages + (Number of outer pages / (number of buffer pages -2) * number of inner pages)

11.3.1.4Index Nested Loop Join

If there is an index, we can exploit this

Cost = non-index number of pages + (number of non-indexed tuples * cost of finding indexed tuple)

If clustered, then can compare all tuples in found pages (cheap). If unclustered, one page per indexed tuples

11.3.2Sort-Merge Join

Sort inner and outer on join column, then scan them to do a merge (join on that column).

Advance scan of outer until current outer-tuple >= current inner tuple, then advance scan of inner until current inner-tuple >= current P tuple, until current inner-tuple = current outer-tuple

Cost: (number of passes for sorting + 1 for merge) * (number of inner pages + number of outer pages)

11.4DISTINCT

Very expensive. Need to sort to eliminate duplicates. Done at very very end or when results are already sorted. Use with care.

11.5Pipelining

Try and avoid intermediate complete steps and perform all operations on current page/block.

12NoSQL

relational model is powerful, but very rigid. Expressive at the expense of performance.

NoSQL focuses on scalability via distribution, instead of scaling-up

Trade offs:
1. expressiveness vs performance
2. consistency vs performance
3. consistency vs availability in a distributed systems

Types of NoSQL by data model
key-value
document
extensible-record, column
graph

12.1Key-value

data item has a key and some value. Value can be complex, but transparent to DBMBS. No relationships between different key/value pairs

Interface
get(key)
put(key, value)
versioned usually so get(key, version), put(key,value, version)
Pros
scalability is easy accross many nodes
highly available
partitioned, replication, multi-versioned with weak consistency

12.2Document Store

ie: XML/JSON/ key-value entries

Interface
path expression if nexted
key searches within document colelctions
update operations on entire document

12.3Column Database

store by column first, not by record. Vertically partioned. that is, split a record by attributes, each colum stores primary key value + attribute value.