- 1 ER Model
- 2 Relational Model
- 3 Relational Algebra
- 4 Functional Dependicies
- 5 SQL
- 6 Advanced SQL
- 7 SQL Integrity
- 8 XML
- 9 Transactions
- 10 Indexing
- 11 Query Evaluation
- 12 NoSQL
1ER Model¶
- Entity
- real-world object distinguishable from other objects. An entity is described using a set of attributes
- Entity set
- a collection of similar entities
- eg: all employees
- each entity set has a key, a minimum set of attributes whose values uniquely identify an entity in the set
- each attribute has a domain
1.1ISA Hierarchies: Subclasses¶
A ISA B, then every A is also a B entity. Key is only in B
E/R Viewpoint: An entity has a component in each entity set to which it logically belongs. Its properties are the union of these entity sets
1.1.1Constraints¶
- Overlap Constraint
- can an entity be in more than one subclass?
- Covering Constraint
- must every entity of the superclass be in one of the subclasses?
1.2Relationship¶
- Relationship
- association among two or more entities
- eg: Employee A works in department X
- Relationship set
- Collection of similar relationships
- An n-ary relationship set R relates n entities set E1...En. Each relationship in R involves entites $e1 \in E1, \ldots , en \in En$
1.2.1Many-to-many¶
Each entity can have relationships with multiple entities
note: a relationship is uniquely defined by the primary keys of participating entities. That is Employee A cannot work twice in the department X
1.2.2One-to-many¶
eg: manages. One employee manages several departments. Each department is managed by only one employee
- Key Constraint
- If an entity in an entity set can only have one other entity involved in the relationship, this is a key constraint
- Depicted with an arrow from the one to the one-to-many relationship set
1.2.3One-to-one¶
Represented with key constraints in both directions
1.2.4Participation Constraints¶
- Participation Constraint
- Requires that every entity in the entity set participating in the relationship MUST have a relation
- represented with a thick line from the constrained entity set to the relationship
Can also combine key constraints with participation constraints
1.2.5Weak Entities¶
- Weak entity
- can be indentified uniquely ONLY by considering the primary key of another entity
- The key in this kind of set is the union of the keys of the owner entity set and the set of its own attribtues.
- represented with a bold rectangle
1.2.6Ternary Relationship¶
1.2.7Aggregation¶
- Aggregation
- Allows us to treat a reltionship set R as an entity set so that R can participate in other relationships
2Relational Model¶
ER is a semantic model. Relational model is a DATA model
- Relation
- Consists of a schema and an instance
- Schema
- specifies name of relation, plus a set of attributes, plus the domain/type of each attribute
- Instance
- set of all tuples
- tuples are distinct
- Database Schema
- Collection of relation schemas
A relation can be seen as a table. Column headers = attribute names, rows = tuples/records, columns/fields = attribute values.
The number of rows = cardinality. Number of fields = degree / arity
All rows are distinct. Rows are not ordered
- DDL
- Data defintion language
- defines the schema of a databse
- DML
- Data manipulation language
- manipulates the data (instances of the relations)
- eg: Insert, update, delete
- Query the relationship
- aka Query Language
2.1SQL¶
Structured Query Language (SQL)
Used to define relations and to write/query data
2.1.1Data Types¶
Name | Description |
---|---|
CHAR(n) | A character string of fixed length n |
VARCHAR(n) | denotes a string of up to n charaters |
INT or INTEGER | An integer |
SHORTINT | Shorter version of an integer |
FLOAT or REAL | floating point number |
DOUBLE PRECISION | eg: Double in java |
DECIMAL(n,d) | real number with fixed decimal pooint. Value of consits of n digits with a decimal point d positions from the right |
DATE | In the form YYYY-MM-DD |
TIME | Form is 15:00:02 or 15:00:02.5 |
Bit strings | ??? |
2.1.2Table Creation¶
CREATE TABLE Students (sid INT, name VARCHAR(20), login CHAR(10), major VARCHAR(20) DEFAULT 'undefined')
Defines all attributes of the relation and the type/domain of each.
SQL is case insensitive
2.1.3Destorying/Altering Relations¶
DROP TABLE Students
Destroys the schema AND the tuples are deleted
ALTER TABLE Students ADD COLUMN firstYear:integer
Adding a new row results in all existing tuples gettings a new field with the null value
2.1.4Insert/Delete/Update Tuples¶
Inserting a single tuple
INSERT INTO Students (sid, name, faculty) VALUES (53688, 'Chang', 'Eng')
Deleting Multiple Tuples
DELETE FROM Students WHERE name = 'Chang'
Updating Tuples
UPDATE Students SET faculty = 'Science' WHERE sid = 53688
2.1.5Integrity Constraints¶
An Integrity Constraint Must be true for any instance of the database. There are domain specific constraints that are already enforced (eg on INT)
DB System will enforce any constraint and reject a modification that violates it
2.1.5.1NOT NULL¶
CREATE TABLE Students (sid INT NOT NULL, name VARCHAR(20), login CHAR(10), gpa REAL DEFAULT 0.0)
requires an attribute to always have a proper value
2.1.5.2Primary Key Constraints¶
A set of fields is a key for a relation if:
- No two distinct tuples can have the same values of all key fields
- This is no true for any subset of the key
Any key that satisfies these properties is considered a candidate key. One is chosen to be the primary key
Primary key attributes may not be NULL
CREATE TABLE Students (sid INT NOT NULL PRIMARY KEY, name VARCHAR2(20)) CREATE TABLE Enrolled (sid INT NOT NULL, cid CHAR(20) NOT NULL, grade CHAR(2) NOT NULL, PRIMARY KEY (sid,cid), UNIQUE (cid,grade))
sid makes a good primary key in the first one. In Enrolled we want each student to have ONE grade per class!
2.1.5.3Foreign Key¶
- Foreign Key
- set of fields in one relation that is used ot "refer" to a tuple in another relation
- must correspond to the primary key of the second relation
-
represents a "logical pointer"
CREATE TABLE Enrolled
(sid INT NOT NULL,
PRIMARY KEY(sid)
FOREIGN KEY (sid)
REFERENCES Students) - Foreign Key Constraint
-
foreign key value of a tuple must represent an existing tuple in the referred relation.
Tuples that refer to nonexistent foreign keys are not allowed to be inserted.
- What happens when a refered tuple is deleted?
- Delete all tuples that reference it
- Disallow deletion as long as something refers to it
- set the foreign key to a feault value
- When the primary key is updated
- updated all tuples that reference it
- or do something similar to deletion (disallow, delete referencing tuples)
SQL default to disallowing. Can use CASCDE
eg:
CREATE TABLE Enrolled (sid INT NOT NULL, cid CHAR(8) NOT NULL, grade CHAR(2), PRIMARY KEY (sid,cid) FOREIGN KEY (sid) REFERENCES Students ON DELETE CASCADE ON UPDATE CASCADE)
Integrity Constraints are based on the semantics of the realworld. You can check that an IC holds by looking at the whole table, but you can't assume from examing one or some tuples
2.2Translating ER to R¶
2.2.1Entity Sets to Relations¶
One relation per entity set
2.2.2Many-to-Many relationship sets¶
Always translated as an individual table. Attributes for the table are the keys for each participating set (as foreign keys) and all descriptive attributes
2.2.3Key Contraints¶
If one-to-many make the one the primary key, hence guaranteeing uniqueness. Entity set and relationship each gets a table
OR
Include relationship set in table of the entity set with the key constraint as a field.
2.2.4Key Contraints and participation contraints¶
Include relationship set in table of the entity set with the constraints
2.2.5Partitipation Conatraints¶
Usually can't be done UNLESS also key constraint
2.2.6Renaming¶
Might happen if a relation involves to entities from the same set.
2.2.7Weak Entity Sets¶
Weak identity set and identifying relationship set are translated into a single table. Must cascade on deletion of primary foreign key
2.2.8ISA Hierarchies¶
2.2.8.1General approach¶
distribute information among relations. Super class defines general attributes and key. Subclasses are relations with additional attributes and superclass key. subclasses must be altered when superclass altered.
2.2.8.2object-oriented approach¶
sub-classes have all the attributes
Pro | Con |
---|---|
Query asking about just subclass entities only has one table to search | Query on attributes of all superclass entities have to traverse all tables of subclasses |
Entities that are in two subclasses are stored twice |
2.2.8.3One big relation approach¶
One big table with all the attribtues of all subclasses (can be null)
2.2.9Aggregation¶
- Option 1: No key contraints
- Basically just create a table for the aggregation relation that references the keys of all involved
- Projects(pid,started_on,pbudget)
- Departments(did,dname,budget)
- Employees(eid,name,salary)
- Sponsors(pid,did,since)
- Monitors(pid,did,eid,until)
- Option 2: Use inner relationship
- Key constraint on relationship within the aggregation and the other entity set
- eg: Sponsor(pid,did, eid, since, until)
- Option 3: Use Outter Relationship
- No Sponsors
- Project(pid, started_on, pbudget, did, since)
- Monitors(pid,eid,until)
3Relational Algebra¶
Basics
- Input
- input one or two relations
- schema of each relation is known
- instance can be arbritrary
- Output
- a relation
- schama of output relation depends on operator and input relations
Relational algebra is closed
Name | Symbol | Description |
---|---|---|
Selection | $\sigma_{condition}(R_{in})$ | Selects a subset of tuples from a relation |
Projection | $\Pi_{attribute}(R_{in})$ | Projects to a subset of attributes from a relation |
Renaming | $\rho(R_{out}(B1, \ldots, Bn), R_{in}(A1, \ldots, AN))$ | Rename relations or attribtues |
Cross product | $\times$ | Combines two relations. Each row of first input combined with each row of second input |
Join | $\bowtie$ | Combination of cross product and selection |
Division | \ | TODO |
Intersection | $\cap$ | |
Union | $\cup$ | |
Set Different | - | Tuples that are in the first, but not the second |
Operators can be composed
Set operators must be on relations with the same number of attribtues and same types
3.1Types of Joins¶
- Condition/Theta Join $R_{out} = R_{in1} \bowtie_c R_{in2} = \sigma_c(R_{in1} \times R_{in2})$
- Equi Join: $R_{out} = R_{in1} \bowtie_{Rin1a1 = Rin2b1, \ldots, Rin1an = Rin2bn} R_{in2}$ Condition join where condition contains ONLY equalities
- Natural Join: Equijoin on all common attributes
3.1.1Division¶
Let A have have 2 fields, x and y, have only field y.
$A/B = \{ x | \forall y \in B \exists x,y \in A\}$
A/B containsall x typles such athf ro every y tuplein B there is an xy tuple in A
3.1.2Rules of Relational Algebra¶
Let $R,S,T$ be relations. $C, C1, C2$ are conditions. $L, L1, L2$ are projection lists on R and S.
- Communitivity
- order of operands does not matter (except renaming, division and difference)
- $\Pi_L (\sigma_C (R)) = \sigma_C (\Pi_L (R))$
- $R1 \bowtie R2 = R2 \bowtie R1$
- Associativity
- You can move brackets around
- $R1 \bowtie (R2 \bowtie R3) = (R1 \bowtie R2) \bowtie R3$
- Idempotence
- $\Pi_{L2} (\Pi_{L1}(R)) = \Pi_{L2} (R)$ if $L2 \subset L1$
- $\sigma_{C2} (\sigma_{c1}(R)) = \sigma_{C1 \wedge C2}(R)$
4Functional Dependicies¶
4.1Types of Problems¶
Suppose we have Hourly_Emps(sin, rating, hourly_wages, hourly_worked)
Wages is related to rating
- Update Anomly: Can we change the wages in only the first tuples?
- Insertion Anamoly: What if we want to insert an employee and don't know the hourly wage for his/her rating?
- Insertion Anamoly II: We can only introduce a new rating scheme when we enter an emplyee with this rating into the relation
- Deletion Anamoly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!
- Storage Redundancy
4.2Functional Dependencies¶
A functional dependency $X \to Y$ ($X$ and $Y$ are sets of attributes; $X,Y \subset$schema($R$)) holds over relation $R$ if for every allowable instance $r$ of $R$:
* Given tuples $t1$ and $t2$ in instance $r$ of $R$, if the X values of $r1$ and $t2$ agree, then the $Y$ values must also agree
* $t1 \in r, t2 \in r, \Pi_x(t1) = \Pi_x(t2) \to \Pi_y(t1) = \Pi_y(t2)$
FDs are determined by the semantics of the application.
- Key Candidates
- Let schema$(R) = (a1, a2, ... an)$; a set of attribtues $X \subset$ schema$(R)$ is a candidate key for a relation $R$ if $X \to R (= X \to a1,a2, \ldots, an)$
- That is some set of attributes of R (could be a single element set) implies all of the attributes of the relation
In the first example, we would move rating -> wage into it's own table and then just store rating!
4.3FD Axioms¶
($X,Y,Z$ are sets of attributes)
pname | Description |
---|---|
Reflexivity | If $Y \subset X$ then $X \to Y$ |
Augmentation | $X \to Y$ then $\forall Z, XZ \to YZ$ |
Transitivity | If $X \to Y, Y \to Z$ then $X \to Z$ |
Union | If $X \to Y, X \to Z$ then $X \to YZ$ |
Decomposition | If $X \to YZ$, then $X \to Y, X \to Z$ |
- Armstrong's Axioms
- Reflexivity
- Augmentation
- Transitivity
- $F^+$
- closure of F
- set of all DFs that are implied by F through Armstrong's Axioms
- Algorithm for finding the attribute closure of $X \to X^+$ wrt F:
-
Input: set of all attributes $A$ such that $X \to A$ is in $F^+$
Basis: $X^+$
Induction: If $Z\subset X^+$ and $Z \to A$ is in $F$, add $A$ to $X^$
End when $X^+$ cannot be changed - Finding all key candidates
-
X is akey candidate if X+ = R AND there is no subset Y of X such that Y+ = R
-
X has to imply everything and it has do do so minimally
4.4Normalization¶
If only key-induced FDs hold there is no redundancy in the database
4.4.1BCNF¶
Boyce-Codd Normal Form
A relation R with FDs F is BCNF if for all $X \to A$ in F+:
* $A \in X$ (trivial FDs)
* $X$ contains a key for $R$
R is in BNCF if the only non-trividal FDs that hold over R are key constraints
4.4.23NF¶
A relation R with FDs F is in 3NF (3rd Normal Form) if for all $X\ \to A$ in F+
- $A \in X$ (trivial FDs)
- X Contains a key for R (X is a superkey) or
- A is a part of some key for R
4.5Decomposition¶
If a relation is not in the desired normal form we decompose it.
Given a relation R with attributes A1, ..., An, a decomposition of r consists of replacing R by two or more relations such that:
each new scheme contains a subset of attributes of R and nothing more
every attribute of R appears as an attribute on one of the new relations
* The ruples of R are split such that corresponding subtuples appear in the enw relations
Disadvantages: some queries now have to look at more tables and are thus more expensive
- Loseless Join decompositions
- The decomposition of R into R1 and R2 is lossless-join wrt F if and only if the closure of F contains: Let {a1, ..., an} be the set of attribute sin R1 and R2: then {a1, ..., an} $\to$ R1 or {a1,..., an} $\to$ R2
- In particular if $U \to V$ (non trivial) holds over R then the decomposotion of R into UV and R - V is lossless-join (rejoin does not create new tuples because U is key in UV and foreign key in R-V)
- Denoted as a decomposition along FD $U \to V$
4.5.1Decomposing into BCNF¶
Given a relation R with FDs F and $X \to Y \in FD$ violating BCNF:
- Compute X+
- Decompose R into X+ and $(R - X+) \cup X$
- Repeat until all relations in BCNF
If $X^+ = \{X,Y\}$ we denote this as a decomposition along $X \to Y$
- Dependency Preserving Decomposition
- Given a relation R with FDs F. A depdendency preserving decomposition allows us to check each dependency in F by looking at one of the decomposed relations
- Not all BCNF decompositions are dependency preserving
4.5.2Decomposing into 3NF¶
Any relation R can be decomposed into relations that are 3NF and dependency preserving
5SQL¶
SELECT desired attribtues FROM list of relations WHERE qualification (optional)
Conversion to relational algebra
SQL | Algebra |
---|---|
SELECT | projection |
WHERE | selection and join |
Sample:
SQL statement:
SELECT rating, age FROM Skaters WHERE rating >= 10 OR age > 15
Relational Agebra:
$\Pi_{rating, age}(\sigma_{rating >= 10 \lor age > 15}(Skaters))$
The only difference is that SQL does not eliminate duplicates
5.1WHERE clause¶
- Comparision
- attribute1 op constant
- atttribute1 op attribute2
op is one of <, =, >, <>, <=, >=, LIKE
Can use arithematic operations on numbers
- Qualification/Condition
name = 'Cheng' AND NOT age = 18
name LIKE '%e_g'
%: any string, _: any character- there are other string operators like string-length, concatentation
SQL uses multiset semantic, which allows duplicates, except on primary keys
DISTINCT``` is an optional keyword that will eliminate duplicates eg: ```sql SELECT DISTINCT sname FROM Skaters
- Renaming Columns
SELECT sid, sname AS skater FROM Skaters WHERE raing < 9
- Expressions
SELECT sname, rating+1 AS updgrade FROM Skaters
- Constants
SELECT rating AS reality, '10' AS dream FROM Skaters
- Will add colum dream with value 10 to output
- Ordered Output
SELECT * FROM Skaters ORDER BY age, rating
- will wort by age THEN rating
5.2Multirelational Queries¶
You can have MULTIPLE relations in the FROM clause. Relation.Attribute is used to disambiguate
- Cross Product
SELECT * FROM Skaters, Participates
- Join
SELECT sname FROM Skaters, Participates WHERE Skaters.sid = Participates.sid
- OR
SELECT sname FROM Skaters JOIN Participates ON Skaters.sid = Participates.sid
5.2.1Union, Intersection, Difference¶
Input relations for set operators must be set-comaptible. Result relation is a set (not a multiset)
5.2.1.1Union¶
Find skaters that have participated in a regional or a local competition
SELECT P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'local' UNION Select P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'regional'
Multiset version: UNION ALL
5.2.1.2Intersection¶
Find skaters that have participated in a regional AND a local competition
SELECT P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'local' INTERSECT Select P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'regional'
Multiset Version: Intersect ALL
5.2.1.3Diference¶
Find skaters that have participated in a local but not in a regional
SELECT P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'local' EXCEPT Select P.sid FROM Participates P, Competition C WHERE P.cid = C.cid AND C.type = 'regional'
Multiset version: EXCEPT ALL
5.2.2Nested Queries (IN)¶
A where clause and contain an SQL query which is called a subquery
Find names of skaters who ahve participated in competition #103
SELECT S.sname FROM Skaters S WHERE S.sid IN (SELECT P.sid FROM Participates P WHERE P.cid = 103)
Can invert selection using NOT IN
Basical it is WHERE (a1, a2) in (SELECT a3, a4...)
5.2.3Exists¶
EXISTS is TRUE iff the relation is non-empty
Find names of skaters who have participated in competion 103
SELECT S.sname FROM Skaters S WHERE EXISTS (SELECT * FROM Participates P WHERE P.cid = 103 AND P.sid = S.sid)
5.2.4Quantifiers¶
Can use ANY and ALL (like universal and existential quantifiers)
- Syntax
WHERE attr op ANY (Select ...)
WHERE attr op ALL (SELECT ...)
Find highested rated skater
SELECT * FROM SKATERS WHERE rating >= ALL (SELECT rating FROM Skaters)
5.2.5Division¶
Find skaters who have reserved all boats
SELECT sname FROM Skaters S WHERE NOT EXISTS ((SELECT C.cid FROM Competition C) EXCEPT (SELECT P.cid FROM Participates P WHERE P.sid=S.sid)) SELECT sname FROM Skaters S WHERE NOT EXISTS (SELECT C.cid FROM Competition C WHERE NOT EXISTS (SELECT P.cid FROM Participates P WHERE P.cid = C.cid AND P.sid = S.sid))
6Advanced SQL¶
6.1Aggregation¶
- Counting
SELECT count(*) FROM Skaters
- Averages
SELECT avg(age) FROM Skaters WHERE rating = 10
- Distinct
SELECT count(DISTINCT sid) FROM Participates
Also, SUM, MAX, MIN
6.1.1Grouping¶
SELECT target-list FROM relation list WHERE qualification GROUP BY grouping listn
A group is a set of tuples that ahve the same value for all attributes in the grouping list. One answer tuples is generated per group. Target-liust contains attributes and/or aggregation terms
eg:
SELECT rating, avg(age) FROM Skaters GROUP BY rating
If any aggregation is used, then each element in the attribute list of the SELECT clause must either be aggregated or appear in a group-by clause
6.1.1.1HAVING clause¶
These are just WHERE claues on GROUP BY
eg: Find average age of skaters with at least two skaters at each rating level
SELECT rating, avg(age) FROM Skaters GROUP BY rating HAVING COUNT(*) >= 2
Aggregate operations cannot be nested!
6.2Views¶
A view is an umaterialized relation: we store a definition rather than tuples, and get the tuples from existing tables
CREATE VIEW ActiveSkaters (sid, sname) AS SELECT DISTINCT S.sid, S.sname FROM Skaters S, Participates P WHERE S.sid = P.sid
6.3NULL Values¶
Results in issues for comparisons and arithemtic. Need 3-valued logic (true, false, unknown)
- NOT unknown = unknown
- A or B = true if either A = true of B = true
- A or B = false if A = false and B = false
- A or B = unknow if A = false and B = unknown, or vice versa, or A = B = unknown
- A and B = true if A = true and B = true
- A and B = false if either A or B = false
- A and B = unknown if A = true and B = unknown, or vice versa, or A = B = unknown
In SQL the WHERE clause eliminates rows that are not true (ie: false or unknown)
6.4Out Join¶
Same as an inner join, but with dangling tuples paadded with NULL
A regular join on Skaters and Participates only includes result tuples that are in both Skaters and Participates.
An (left) outer join on Skaters (left) and Participates includes Skater tuples that weren't in Participates and pads the missing information with nulls
SELECT S.sid, Sname, P.cid FROM Skaters S LEFT OUTER JOIN Participates P ON S.sid = P.sid
6.5Levels of Abstraction¶
A single conceptual/Logical Schema defines the logical strucre. Conceptual db design
A Physical Schema describes the files and indexes used
Views describe how users see the data
Physical Data Independence: The conceptual schema protects from changes in the physical structure of data
Logical data independence: External schema proctects from changes in conceptual schema of data
6.6DB Modification¶
6.6.1Inserts¶
INSERT INTO SKATERS VALUES (68, 'Jacky, 10, 10)
OR
INSERT INTO Skaters (sid, name) VALUES (68, 'Jacky')
Inserting the result of a query
INSERT INTO ActiveSkaters ( SELECT Skaters.sid Skaters.name FROM Skaters, Participates WHERE Skaters.sid = Participates.sid)
6.6.2Deletion¶
DELETE FROM Competitions WHERE cid = 103
or
DELETE FROM Competition
Will delete EVERYTHING
6.6.3Updates¶
UPDATE Skaters SET ranking = 10, age = age + 1 WHERE name = 'debby' OR name = 'lilly'
7SQL Integrity¶
7.1Constraints¶
- Domain Constraints (data type, UNIQUE, NOT NULL)
- Primary Key Constraints
- Foreign Key Constraints
- Attributes and tuple based checks
- SQL2 Assertions: global constraints (just check that conditions aren't violated, rejection only behaviour)
- SQL3 Triggers: upon user modification, checks are performed and reactions specified
7.1.1Attribute Based Checks¶
CREATE TABLE Skaters ( sid INTEGER PRIMARY KEY NOT NULL, sname VARCHAR(20), rating INTEGER CHECK(rating > 0 AND rating < 11), age INTEGER)
Condition is check when the attribute is changed. Condition violations result in rejections. Can be anything that would follow a WHERE clause including subqueries
eg: sname VARCHAR CHECK (sname NOT IN (SELECT name FROM forbidden))
DB2 does not allow subqueries or refences to other attributes...
7.1.2Tuples Based Checks¶
CREATE TABLE Skaters ( sid INTEGER PRIMARY KEY NOT NULL, sname VARCHAR(20), rating INTEGER, age INTEGER CHECK (rating <= OR age > 5))
Checked on update and insert
7.1.2.1Naming Constraints¶
May need to add or delete constraints laters. So we can NAME THEM
CREATE TABLE Skaters ( sid INTEGER NOT NULL, sname VARCHAR(20), rating INTEGER CONSTRAINT rat CHECK(rating > 0 AND rating < 11), age INTEGER CONSTRAINT pk PRIMARY KEY (sid), CONSTRAINT ratage CHECK (rating <= 4 OR age >5))
Allows us to do ALTER TABLE Skaters DROP CONSTRAINT ratage
or ALTER TABLE Skaters ADD CONSTRAINT rateage CHECK (rating <= 5 OR age > 5)
7.1.3Assertions¶
predicate expressing a condition that we wish the database to always satisfy and what to do when it fails
CREAT ASSETION smallClub CHECK ((SELECT COUNT(*) FROM SKATERS) + (SELECT COUNT(*) FROM Competitions) < 100)
Check whenever any relations involved are modified
7.1.4When and How much?¶
Attribute based checks happen on inserts and upates of that just one tuples. Tuple based checks same thing (cheap)
Assertion, insert or update or delte of tuples involved in any relation (expensive)
7.2Triggers¶
- Trigger
- Prodcued that starts automatically if specified changes occur to the DBMS
Made up of three parts
- Event (activates the trigger): usually insert/update/delete
- Condition (test whether the trigger should run)
- Actions (what happens if the trigger runs)
7.2.1Statement Level Trigger¶
CREATE TRIGGER updateSkater AFTER DELETE ON Skaters REFERENCING NEW_TABLE AS DeletedSkaters FOR EACH STATEMENT INSERT INTO StatisticsTable(ModTable, ModType, Count) SELECT 'Skaters', 'delete', COUNT(*) FROM DeletedSkaters
7.2.2Row Level Trigger¶
CREATE TRIGGER ratingIncrease AFTER UPDATE OF rating on Skaters REFERENCING OLD AS o NEW AS n FOR EACH ROW WHEN (n.rating > 1 + o.rating) UPDATE Skaters SET skating = 1 + o.rating WHERE sid = n.sid
7.2.3Trigger Components¶
When?
- AFTER UPDATE OF
- BEFORE UPDATE OF
- INSTEAD OF UPDATE
(could be UPDATE, INSERT, DELETE)
Scopes?
- Row level trigger: references OLD/ NEW
- Statement level: OLD_TABLE / NEW_TABLE
Indicated in the RERFENCING command
Can execute SEVERAL actions
FOR EACH ROW/STATEMENT WHEN ... BEGIN ATOMIC do thing1 do thing2 END
Can also set SQL States. eg: SIGNAL SQLSTATE '75000' SET MESSAGE_TEXT='Rating increase must be by one'
8XML¶
XML is built with elements
- Elements components
- Name
- Text (eg: foundations)
- Child elements (nesting)
- Attribtues (included in starttag)
Attribtues are made of names and values eg: <p id="first one">
8.1Tree model¶
Eache element is a node. Nested elements build a path. The leaves are attributes, text or just elements with neither. Element siblings are ordered
8.2Graph Model¶
Each element can have an ID attribute allowing elements to references other elements. This produces a graph
8.3DTD¶
A valid xml documents conforms to a Document Type Description (DTD) although they are optional. Has a BNF-life grammar which established constraints on element structure, attributes and content
Sample:
<!DOCTYPE people[ <!ELEMENT people(person*)> <!ELEMENT person(name*, (lastname|familyname)?)> <!ATTLIST person PID ID #REQUIRED age CDATA #IMPLIED children IDREFS #IMPLIED mother IDREF #IMPLIED > <!ELEMENT name(#PCDATA)> <!ELEMENT lastname(#PCDATA)> <!ELEMENT familyname (#PCDATA)> ]>
Data types: PCDATA (parsed character data) or CDATA (unparsed)
- Attributes
- ID unique identifier (similar to primary key)
- IDREF: reference to single ID
- IDREFS: space-seperated list of references
- Values
- can give a default value
- #REQUIRED must exist
- #IMPLIED optional
Specified in an XML file with <!DOCTYPE name SYSTEM "path/to/thing.dtd">
Can use regex style things too. * is 0 or more. + is 1 or more, (a | b)? is one or the other
8.4XPath¶
Specificy a path expresion to match XML data by navigating down (or accross) the XML tree. returns a node-set
- Examples
- /bibliography/book/author all author elements by root navigating through those elements
- /bibliography/book/@ISBN All ISBN attributes
- //title all title elements anywhere in the document
- /bibliography/*/title titles of bibliography entries assuming that there could be books, journals, reports, etc...
- /bibliography/book[@year>1995] returns books where the year > 1995
- /bibliography/book[author='FooBar']/@Year returns the years of books written by FooBar
- /bibliography/book[count(author) <2]
- /bibliography/book/author[position()=1]/name position is the location of the node in the node set
8.5XQuery¶
FLOWR = For/let-where-orderby/return
For vs Let
For binds node variables for iteration
Let binds collection variables for one value
VS
Aggregation
let $a = avg(document("bib.xml")/bib/book/price) for $b in document("bib.xml")/bib/book where $b/price > $a return $b
Can use joins with where $b1/author = $b2/author
subqueries, sorting, if-then-else, existential and universally quantified
eg: ```where some $b in $a/book satisfies count($b/author) > 10$
8.6XML in DB2¶
Storage as extra attribute CREATE TABLE MyXml (id INT NOT NULL PRIMARY KEY, Info XML)
Insert with:
INSERT INTO MyXML(id, INFO) VALUES (1000, '<customerinfo cid="1000"> <name>Kathy Jones</name> <addr country =Canada"> <street>123 fake</street> <city>Ottawa</city> <prov-state>Ontario</prov-state> <pcode-zip>H0H 0H0</pcode-zip> </addr> </customerinfo>')
8.6.1Querying with XQuery¶
db2-fin:xmlcolumn
XQUERY for $d in db2-fn:xmlcolum('MYXML.INFO')/customerinfo where $d/addr[city="Toronto"] return $d/name
SELECT and Xquery
SELECT XMLQUERY ( 'for $d in $doc/customerinfo return $d/name' passing INFO as "doc") FROM MyXml m WHERE XMLEXISTS( '$i/customerinfo/addr[city="Toronto"]' passing m.info as "i")
9Transactions¶
Imagine a money transfer Transfer(id1, id2, value)
SQL
SELECT balance into :balance FROM accounts WHERE accoundid = :id1 balance += value UPDATE accounts SET balance = :balance WHERE accountid = :id1 SELECT balance into :balance FROM accounts WHERE accountid = :id2 balance -= value UDPATE accounts SET balance = :balance WHERE accountid = :id2
- read(A)
- A := A + 50
- write(A)
- read(B)
- B := B - 50
- write(B)
A transaction is the DBMS'sabstract view of a user program. It is a sequence of read operations, r(x) and write operations w(X) on objects x (tuple, relation) of the db
Read: bring obkect into main memory from disk, send variable to application
Write: bring object into main memory from disk and modify it. Write it back to disk (might be done later)
9.1ACID¶
- ACID
- Atomicity: A transaction is atomic if it executed in it entirety or not at all
- Consistency: A transaction must preserve the consistency of the data
- Isolation: In case that transaction is executed concurrently, the result must be the same as if they were executed serially
- Durability: The changes made by a transaction must be permanent
- Consistency
- Each transaction must leave the database if the DB is consistent before the tranasction started. Hard to enforce at the db level unless using integrity constraints
- Isolation
- DBMS can execute transactions concurrently. Enforced by a concurrency control protocol.
- Atomicity
- A transaction either commits or aborts
- In the case of an abort then you have to undo all the operations done
- Local Recovery ensures that partial results can be done. Writes keep a pre-image
- Durability
- There must be a guarantee that changees introduced by a transaction will surive failures
In the event of a crash we have to do Global Recovery. Involves ensuring commited transactions are in the db, that aborted changes are undone and that tranasctions at the time of the crash are undone.
9.2Buffers and the Disk¶
When are changes written from main memory to the disk?
9.2.1FORCE flush¶
Force all changes of transaction are flushed to the disk BEFORE commit. Eg writes are done straight to disk
Easy recovery for commited or aborted tranasctions, but need cleverness to deal with inprogress transactions
9.2.2NOSTEAL flush¶
Changes are never flushed before a commit. Once committed, changes can be flushed at leasure. Inprogress and aborted transactionsa are easyt to deal with, nothing on disk. Need to deal with commited but potentially not written transactions.
Ideally we would flush at commit time, but this isn't possible due to I/O not being atomic.
10Indexing¶
Cost model for data access. We just oncsider number of disk reads.
- Types of opertions
- scan over all records
SELECT * FROM Skaters
- Equality search
SELECT * FROM Skaters WHERE sid = 100
- Range Search
SELECT * FROM Skaters WEHRE age > 5 and age <= 10
- Insert
INSERT INTO Skaters VALUES (23, 'lilly, 10, 8)
- Delete could be equality or range
- udpate is just delete + insert
10.1Heap Files¶
Linked, unordered list of all pages of the file
- scan retriving all records = good because all pages retrieved anyway
- equality search on primary key = bad have to read ~1/2 of pages
- range search = bad
- insert = good, low cost
- delete/update same as search
10.2Sorted Files¶
Records ordered according to one or more attributes of the relation
- scan retriving all records = good
- equality search on sort attribute = good about log2(number of pages)
- range search = good find first one, then scan along
- insert = bad, have to find proper page, need to reorganize
- delete/update same as search. update may be expensive
- sorted output: good if on sorted attribute
10.3Indexes¶
Sorted files can only be ordered by one attribute. Solution is to build an index for any attribute that is frequently queried. Created on a search key (different from primary key)
CREATE INDEX name ON Skaters(sid) DROP INDEX name
B+tree is the most widely used index. Each ndoe represents on page (the transfer unit to/from disk). Leafs contain data entries which consist of value of the search key and the record identifier (rid = page-id, slot). Data entries are POINTERS. Root and inner ndoes have auxiliary index entries. They are height balanced.
fanout = number of children for each node. Minimum 50% occupancy fornot the root.
Indirect indexes can be in the form <key, rid> but could be in form <key, list of matching rid>
Direct indexes can contain the actual tuple, but expensive.
10.3.1Clustered vs Unclustered¶
- Clustered
- relatin in file sorted by search key attributes of the idnex
- Unclustered
- relation in heap file or sorted by attribute different to the search key attribute of the idnex
Can be clustered by at most one index
10.3.2Examples¶
Relation R(A,B,C,D,F)
A and B are int (6 byte), C-F are char[40] (1 byte per char). Tuple = 172 bytes.
200,000 tuples
Each data page has 4000 bytes and is around 80% full.
Number of pages = (number of tuples * tuple size)/(fill rate * page size). in this case 10,750 pages
B values are uniformilly distributed
Non-clustered index B-tree with <k, list of rid>
Index pages are 4K and between 50%-100% fulll
rid = 10 bytes
size of pointer in intermediate page = 8 bytes
Index entry size in root/intermediate pages = size(key) + size(pointer) 6 + 8 = 14 bytes
Average number of rids per data entry = number of uples / different values (if uniform). eg: 200,000/20,000 = 10
Average length per data entry = size of key + (number of rids * size of rid). eg: 6 + 10*10 = 106
Average number of date entrier per page = fill-rate * page size / length of entry. eg 0.75*5000/106 = 28 entries per page
Estimate number of leaf page = number of different values / number of entrier per page. eg: 20,000 / 28 = 715
Number of entries in intermediate pages = fill-rate * page size / lneght of index enty. eg: 0.5 * 4000 / 15 = 143, 1*4000/14 = 285
Height of tree = Number of leaf pages / (min | max)? number of entries in intermediate pages
11Query Evaluation¶
11.1Processing SQL Queries:¶
- parser translate into internal query representation (syntax, references existing stuff, view translation)
- query optimizer turns the query into an efficient execution plan
- plan executor executes the execution plan (nowai!) and delivers data
Query decomposition is variable because order can change, need to pick the best / most efficient
- Access Path
- method used to retrieve set of tuples from a relation
Cost models
- number of pages
- number of I/O (one page can be read from disk multiple times)
Query Processing operations
- Selection
- Scan without an index
- scan with an index
- Sorting
- joining
- nested loop join
- Sort-merge join
- etc...
- grouping and duplicate elimination
Number of tuples in table P = CARD(P).
11.1.1Simple Selections¶
- No index
- Search on arbitrary attribute: scan the whole relation. Cost = number of pages
- Search on primary key attribute: Cost = 1/2 number of pages on average
- With index on selection attribtues
- Cost = number of index pages + number of data pages
Clustered index number of data pages = (number of matching tuples / tuples per page)
Unclustered number of data pages = number of tuples!
11.2External Sorting¶
N pages, B
- Pass 0
- bring B pages into buffer, sort in main memory
- cost N/B
- Pass 1
- ...
11.3Equality Joins¶
- Simple Nested Loop join
- Page oriented nested loop join
- Block oriented nested loop join
- index nexted loop join
- Sort-Merge Join
11.3.1Nested Loop Joins¶
11.3.1.1Simple nest loop join¶
for each tuple in the outer relation P, scan the entire inner relation S
Cost = Number of outer page + number of outer tuples*(number of inner pages)
11.3.1.2Page oriented nested loop join¶
for each page of outer, get each page of inner and write out matching pairs.
Cost = Number of outer pages + Number of Outer Pages * Number of Inner Pages
11.3.1.3Block Oriented Nest Loop Join¶
If the smaller table fits in main memory + 2 extra buffer pages: load smaller into main and compare page by page on larger
Cost = Inner Pages + Outer Pages
If no relation fits in main meory:
For each block of B-2 Pages of Outer
- load block into main memory
- get each page of inner and join with B-2 pages of outer. write output to remaining page
Cost = Number of Outer Pages + (Number of outer pages / (number of buffer pages -2) * number of inner pages)
11.3.1.4Index Nested Loop Join¶
If there is an index, we can exploit this
Cost = non-index number of pages + (number of non-indexed tuples * cost of finding indexed tuple)
If clustered, then can compare all tuples in found pages (cheap). If unclustered, one page per indexed tuples
11.3.2Sort-Merge Join¶
Sort inner and outer on join column, then scan them to do a merge (join on that column).
Advance scan of outer until current outer-tuple >= current inner tuple, then advance scan of inner until current inner-tuple >= current P tuple, until current inner-tuple = current outer-tuple
Cost: (number of passes for sorting + 1 for merge) * (number of inner pages + number of outer pages)
11.4DISTINCT¶
Very expensive. Need to sort to eliminate duplicates. Done at very very end or when results are already sorted. Use with care.
11.5Pipelining¶
Try and avoid intermediate complete steps and perform all operations on current page/block.
12NoSQL¶
relational model is powerful, but very rigid. Expressive at the expense of performance.
NoSQL focuses on scalability via distribution, instead of scaling-up
Trade offs:
1. expressiveness vs performance
2. consistency vs performance
3. consistency vs availability in a distributed systems
- Types of NoSQL by data model
- key-value
- document
- extensible-record, column
- graph
12.1Key-value¶
data item has a key and some value. Value can be complex, but transparent to DBMBS. No relationships between different key/value pairs
- Interface
- get(key)
- put(key, value)
- versioned usually so get(key, version), put(key,value, version)
- Pros
- scalability is easy accross many nodes
- highly available
- partitioned, replication, multi-versioned with weak consistency
12.2Document Store¶
ie: XML/JSON/ key-value entries
- Interface
- path expression if nexted
- key searches within document colelctions
- update operations on entire document
12.3Column Database¶
store by column first, not by record. Vertically partioned. that is, split a record by attributes, each colum stores primary key value + attribute value.