A synonym for information. A data about you, for instance, would be your name, age, height, weight, occupation, address, contact number, and plenty of other things. All this is information, or data about you. This book’s data include no. of pages, author, publisher, book id, price, etc.
What is a database?
A database is collection of interrelated data organized in such a way so that it is easy to access, retrieve, manage and understand data. Say, for example, a list of students - you will have a list of the names, along with their roll nos. You may also have a table somewhere about the marks of each student on your list. There will be separate files for mark lists for different subjects, and courses. These mark lists may be queried to find the aggregate marks, and overall percentages, which may be stored in some result files. All this collective information, derived from raw data, is called a database.
So, what are database systems?
Database Systems are general-purpose software or applications that help manage a database. They are a collection of programs, each one of which is responsible for individual tasks such as updating the database, deleting some contents from the database, adding data to the database, etc. Manually organizing data is a very cumbersome process. It is time-taking, expensive, and very tedious. Imagine having to calculate results of thousands of students, belonging to several different classes, tens of subjects in each one of them. You need software to make your life easier.
Characteristics of a Database System
Following are some characteristics of a database system. Some points have been further elaborated in the next sections in the comparison of modern database systems and traditional file systems. Traditional file systems are simply a collection of files that form the database and specific application programs that are relevant to that database. In the following sections we cover the basic characteristics of a modern day database system and then compare the traditional file systems with their new age counterparts.
- Comparable to real world: Database is stored in a way comparable to real world entities. There are various ways to represent real world scenarios and the complex relationships using data models which we will read in the coming up sections, for example object-oriented data model, the relational data model.
- Extensibility: It is possible to extend the database to accommodate changes in requirements, making the project more scalable.
- Prevent redundancy: Redundancy, or repetition of the same data in different storage locations is eliminated. This also brings to the fore the concept of centralized data, wherein all data is stored in one central location, so that there is one universal copy for all. This also eliminates the error produced when any change occurs in the data, since only one copy is updated, and all programs access the same single updated copy of data.
- Sharable data: It is convenient to share data between different users as well as applications for better performance. This property is extremely useful in multi-user systems where sharing of data is an integral component.
- Security: Protective measures should be taken to prevent any unauthorized or accidental access to or modification of database. This could involve use of security keys and authentication codes, unique to individual systems.
- Consistency: For the database to be reliable, data is maintained to be consistent. Any changes in the stored data are reflected throughout all applications using the information.
- Data insulation: The program and data are two different components of a database system. They are not dependent on each other, and any change in one does not affect the other. This is also termed as data-program independence.
- Multiple views: Different users are provided with different views as per their requirements and access privileges. A view is a subset of the complete database, providing only the information relevant to the viewer.
- Metadata: Metadata is data about data. All information about the database, including storage information, structure of the database, description of all types of constraints used.
- Integrity: Integrity is maintained in a database by enforcing several integrity constraints, which ensure that the database always contains accurate data.
- Fault tolerance and recovery: In case of any failure, whether a hardware fault or a software crash, the database should not be damaged. The system should ensure automatic recovery of all data.
- Ease of data retrieval: Data is more easily accessible, and with the help of modern query languages, it is easier to retrieve particular data from huge databases. This eliminates the need for writing new application programs from scratch for every new query.
- Concurrency: Multiple users can access same data without crashing the system. This feature extends to multiple transactions taking place simultaneously. Concurrency control systems provide a way to effectively carry out multiple transactions, though there are still measures to control the level of concurrency.
- ACID properties: These are a set of 4 properties – Atomicity, Consistency, Isolation and Durability. These properties are studied in detail later in this chapter.
- Structured and unstructured data: it is possible to store all types of data – structured, unstructured, and semi-structured data. Data in different formats, with different file sizes and properties can be easily accommodated in our database. As we will read in subsequent chapters, database systems are very much essential in analyzing Big Data.
Traditional file systems
Before the advent of modern general purpose database systems, traditional file systems were used to store, manipulate, retrieve and delete data. These file systems comprised of two major components: the data stored as a collection of files, and several application programs that accessed and manipulated the files. These systems were very complicated because of a variety of reasons. Data was not centralized, meaning there could be two or more copies of same data, resulting in redundancy. Redundancy would lead to difficulty in maintaining consistency throughout the system. Often, application programs were targeted to achieve some of the major queries, such as adding new tuples, removing some data, manipulating existing data, etc. However, for each new query, either a new application had to be written, or some entirely different programs were to be clubbed together in a complex manner to achieve the desired result.
As technology improved, more and more advances were made in the existing database systems. Progress was made towards more efficient systems, more application programs were added to generalize individual systems, among other changes. This resulted in different types of database systems, as follows:
- Single file or flat file database system: This is the traditional file based data storage system, consisting of collection of files and related application programs.
- Hierarchical system: Data is stored in a hierarchical format, with parent entities at the top, linked to its child entities. As the name suggests, hierarchical system enforces a tree like structure, with the only links possible between parent and its children. There is no link between siblings, which may or may not be a disadvantage, based on the user requirements. The implementation is pretty straightforward, and hence it is highly recommended for storing and managing structured data. An example approach is using XML (eXtensible Markup Language).
- Networked system: A network system allows for more interconnections between data, that is when many-to-many relationships exist between records, network systems should be used. It is more efficient and useful for large scale projects, such as maintaining company records, where records may not necessarily be stored in a hierarchy. Network based DBMS are useful for managing structured data. High level languages such as Pascal, Cobol, C++ are used to implement network – based database systems.
- Relational database system: This type of system is one of the most popular systems in use today. It emphasizes the importance of relations between records and entities. This paves way for more flexibility and more extensibility. Data is stored in tabular form, and relationships are clearly defined between each entity or table. Such DBMS are also called as RDBMS. Several popular RDBMS are Oracle, MS Access, SQL.
- Object-oriented approach: This is also one of the very popular systems, and the most modern one. Classes and objects are used to describe real world entities. It has a high significance in today’s world, where unstructured and semi – structured data have become mainstream. Object-oriented DBMS are comprised of objects and their behavior. Objects are implemented with the help of a variety of data structures, and this ability to store information of all types and in all formats is what makes this approach attractive. Behavior of objects is implemented with the help of sub – programs called methods, or subroutines, functions etc. One major disadvantage is the cost, as these systems can be expensive and are suitable for large scale projects. These can be implemented using high level languages such as Java, C++.
Some examples of general purpose database systems are MySQL, MS Access, SQL. It is possible for programmers to build their own database system for specific requirements.
Why is a database system required at all?
Traditionally, information was stored in a number of files, which were saved on disks. File systems are still feasible for storing a certain amount of information, however, you need a database system if you need to constantly updating large amounts of data, or you want to eliminate redundancy. Following are some major advantages of database systems over traditional file systems (Note, when referring to database systems, one refers to the modern database systems such as Oracle, unless stated otherwise). Some of the points may be a repetition of the characteristics, but are given here for comparison.
- Centralized data for all purposes. This also helps in removing redundancy from your database. Redundancy means storing the same data repeatedly in several files, thereby increasing storage costs. For example, with a database system You do not need to maintain 20 student lists – one for each subject, instead you can have one central list for all subjects, accessible to and modifiable by all subject teachers.
- Consistency. Having redundant data may also lead to inconsistency. If you have several copies of same data, any change you make in one may not be reflected in the other. In traditional files systems, database is handled by its own collection application programs, and all programs need to be changed. There exists probability of database being inconsistent due to affecting multiple programs. However, in modern database systems, change in data does not affect the system. This is also termed as program-data independence. Your data becomes highly inaccurate and is no longer reliable. Take a situation where a student changes his address or drops out. Data now needs to be updated in several different files, some of which may be linked together. With a single central database, one only need to change the data at one place, it will automatically be reflected in others.
- It is easier to store structured and unstructured information together using modern files systems. It is way more difficult to store vast amounts of unrelated data consisting of files in different formats together. This type of data is also today popularly called as Big Data. Your Facebook data, for example, consists of a variety of data forms: images (in all formats), docs, textual content, videos, map location, etc. All this data is interconnected with accounts of other people, pages, and numerous other entities. Imagine having to maintain several different application programs for each and every activity, different kinds of posts, and all kinds of interconnections between people.
- Database integrity, meaning maintaining the validity and accuracy of data. Database must satisfy all the integrity constraints, otherwise it becomes unreliable for use. Constraints can be thought of as rules that should be followed by one or more data components. For example, the amount to be withdrawn from a bank should never exceed the amount present in bank account. In file systems, such constraints are enforced with the help of application programs. We have already discussed how any change to the database involves changing all involved application programs. Here again, if we need to introduce a new constraint, such as, no person can open an account with less than $1000, or if we wish to relax an existing constraint, for example, extending certain loans to students, we will have to change all related application programs. This is a very costly business, with several overheads and issues. However, with most database system softwares, it is easy to add new constraints, modify existing ones, and even deleting some constraints, without having to worry about its impact on your database. Integrity constraints are a useful way of ensuring that the database remains intact in case of any modification to it. It is a way of guaranteeing that the database will not be affected when any changes are made to the system.
- Security. Not everyone should be allowed access to all information in your database. A customer has access to only his bank account, and nothing else. An employee cannot perform any unauthorized modifications to user’s account data, even though he knows how data is stored on servers. The application developer need only know the program specifics, nothing else. Therefore, enforcing security in our database system is an essential step to secure the database. Almost every database management system implements certain security features to protect user data. Database security means protecting data from unauthorized or accidental access or modification of data. Various authorization and permission granting techniques are applied to enforce security, either by using passphrases to authorize access to certain parts of the database, or giving read-only, read-write etc permissions to users.
- Concurrency control. What would happen if several users were to access and modify the same file at the same time? The system would probably crash, or application programs may have been written to prevent multiple users from accessing same data at the same time. To improve efficiency of the system and reduce time costs, database systems have in-built features to provide concurrency control, that is, control of data in case of concurrent transactions. Transactions are any query on the database, accessing and/or modifying parts of the database. Suppose two users simultaneously try to write data to a file. The last write saved will be the one actually written to the file after both the transactions (write commands) have been completed. This change in data may not be accurate/representative of the actual intentions of both the users, as the actual output depends on the order of execution of both the transactions. In such cases, database systems use various protocols to decide a safe sequence or order of transactions.
- Fault tolerance. Database systems have high fault tolerance. By fault tolerance, we mean that the integrity of the database is maintained in case of any software or hardware failure. To ensure that the database is unharmed, any change to it at the time of failure should not be partial. Any transaction should either be completely executed, or not executed at all. Partial transactions render the database inaccurate and unsafe. This property of enforcing that all transactions are either completely executed or not executed at all is called atomicity.
- Ease of data retrieval. File systems are based on storing data in form of files and developing specific application programs to handle queries related to the database. This makes it almost impossible to introduce a new query. For this, a new application program would have to be written that accesses all the necessary files and supports that one particular query. Modern day database systems use special query languages which can search the organized database for any type of queries. There are limitations, however, it is more easier than writing a new application program for each new query.
Database Users
Database users include all persons accessing, or having access to the database. The users vary in their access privileges, their responsibilities and role. Following is a list of different roles of people accessing the database, along with their responsibilities:
- Database Administrator: The Administrator, or the admin is the person responsible for authorizing access and share controls to other users, managing resources and handling security related activities. He administers all activities related to the database. All metadata and storage definitions are managed by the admin. His primary goal is to ensure a smooth experience for the system’s end-users.
- Application programmers: As we have already seen, a major component of database systems are application programs that help manage data, manipulate, access, retrieve data. These application programs are developed by application programmers. There major responsibility is to develop efficient, time – saving, cost – effective programs that satisfy all requirements of the end – user. They use tools and software to develop interfaces for users.
- Database designer: Their main role is to design the structure of the database, more importantly how data is going to be stored, defining integrity and other types of constraints, designing views for multiple users.
- System Analyst/Software engineer: Their primary responsibility is to document user and customer requirements in SRS docs, develop software specifications for the product .
- Naïve users: These are occasional users of the DBMS, they typically use the interfaces and application programs that have already been developed. For example, a customer who wishes to check his bank account balance.
- Sophisticated users: They use query processing languages to carry out their own transactions or queries on the database. They may use several data mining tools for this purpose. Examples include data analysts.
- Specialized users: These users write their own software to interact with the DBMS for their complex requirements. Examples include scientists, engineers.
- Standalone users: These users maintain their own personal database with the help of existing as well as ready-to-use database systems. They are the designers, administrators and the users as well. Examples include a small business owner who makes use of Excel sheets to monitor expenses etc.
Three-tier architecture of DBMS
Database systems have involved from one-tier architecture to two and three-tier architecture. An n-tier architecture divides the databases system into n modules, where each module has its own functionality and is visible to its own set of users. A one-tier architecture consists of mainly the database and does not supply any functionality to the user. This kind of architecture is useful for database designers and programmers. The two-tier architecture comprised of client-end and server-end. At the client-end were users of the database applications, who interacted directly with the system at the server-end. A two-tier architecture is typically used by programmers. Examples include ODBC, JDBC where users wrote application programs that made direct calls to the database. In three-tier architecture, this direct connection between users and the system has been removed. There is an intermediate interface between them, and the client-side is now effectively only a front-end view for the users. The three-tier architecture is the most popular and widely used in the present industry. The three tiers are:
- The data tier or internal level: This is where our database resides. At this level, description of how data is stored, the query languages associated, managing relations between data is given. Also, various constraints are defined here along with the metadata. This level is accessible to database designers as well as administrators.
- Application tier or conceptual/logical level: This is an interface between users and raw database. Data abstraction is the key to this tier – users are not aware of any details of the database further than this level, which provides them a minimal abstract view of the entire database. It acts as an insulation between users and database. Users can access database only through applications defined at this level. This level is for application programmers and sophisticated users.
- User/Presentation tier or external/view level: This is the level visible to database end users. There are multiple views of this same level, each view pertains to its precise set of users. These multiple views are generated by application programs that decide what information is to be presented and what abstracted for each user. This tier is typically for naïve end users, who do not require access to the entire database, but are only concerned with querying, manipulating retrieving data for which they have access privileges.