Tutorial 07
01. Discuss the role of data in information systems indicating the need for data persistence
An information system (IS) is a set of components that work together to manage data processing and storage. Its role is to support the key aspects of running an organization, such as communication, record-keeping, decision making, data analysis and more. Companies use this information to improve their business operations, make strategic decisions and gain a competitive edge.
All information systems require the input of data in order to perform organizational activities. Data, as described by Stair and Reynolds (2006), is made up of raw facts such as employee information, wages, and hours worked, barcode numbers, tracking numbers or sale numbers. The scope of data collected depends on what information needs to be extrapolated for maximum efficiency.
02. Explain the terms: Data, Database, Database Server, and Database Management System
- What is Data?
In simple words data can be facts related to any object in consideration. For example your name, age, height, weight, etc are some data related to you. A picture , image , file , pdf etc can also be considered data.
- What is a Database?
Database is a systematic collection of data. Databases support storage and manipulation of data. Databases make data management easy. Let's discuss few examples.
An online telephone directory would definitely use database to store data pertaining to people, phone numbers, other contact details, etc.
Your electricity service provider is obviously using a database to manage billing , client related issues, to handle fault data, etc.
Let's also consider the facebook. It needs to store, manipulate and present data related to members, their friends, member activities, messages, advertisements and lot more.
We can provide countless number of examples for usage of databases.
- What does Database Server mean?
The term database server may refer to both hardware and software used to run a database, according to the context. As software, a database server is the back-end portion of a database application, following the traditional client-server model. This back-end portion is sometimes called the instance. It may also refer to the physical computer used to host the database. When mentioned in this context, the database server is typically a dedicated higher-end computer that hosts the database.
Note that the database server is independent of the database architecture. Relational databases, flat files, non-relational databases: all these architectures can be accommodated on database servers.
File System
File is a collection of related records stored on a storage medium such as a hard disk or optical disc
Let’s see some pros and cons involved in saving files in the file system.
Pros of the File System
- Performance can be better than when you do it in a database. To justify this, if you store large files in DB, then it may slow down the performance because a simple query to retrieve the list of files or filename will also load the file data if you used
Select *
in your query. In a files ystem, accessing a file is quite simple and light weight. - Saving the files and downloading them in the file system is much simpler than it is in a database since a simple "Save As" function will help you out. Downloading can be done by addressing a URL with the location of the saved file.
- Migrating the data is an easy process. You can just copy and paste the folder to your desired destination while ensuring that write permissions are provided to your destination.
- It's cost effective in most cases to expand your web server rather than pay for certain databases.
- It's easy to migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future.
Cons of the File System
- Loosely packed. There are no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee. Consider a scenario in which your files are deleted from the location manually or by some hacking dudes. You might not know whether the file exists or not. Painful, right?
- Low security. Since your files can be saved in a folder where you should have provided write permissions, it is prone to safety issues and invites trouble, like hacking. It's best to avoid saving in the file system if you cannot afford to compromise in terms of security.
Database
Database is a collection of data organized in a manner that allows access, retrieval, and use of that data, Let’s see some pros and cons involved in saving files in the
Pros of Database
- ACID consistency, which includes a rollback of an update that is complicated when files are stored outside the database.
- Files will be in sync with the database and cannot be orphaned, which gives you the upper hand in tracking transactions.
- Backups automatically include file binaries.
- It's more secure than saving in a file system.
Cons of Database
- You may have to convert the files to blob in order to store them in the database.
- Database backups will be more hefty and heavy.
- Memory is ineffective. Often, RDBMSs are RAM-driven, so all data has to go to RAM first. Yeah, that’s right. Have you ever thought about what happens when an RDBMS has to find and sort data? RDBMS tracks each data page — even the lowest amount of data read and written — and it has to track if it’s in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc.
Structured data usually resides in relational databases (RDBMS). Fields store length-delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.
Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.
05. Explain different types of databases, providing examples for their use
Relational Database
The relational database is the most common and widely used database out of all. A relational database stores different data in the form of a data table.
Operational Database
Operational database, which has garnered huge popularity from different organizations, generally includes customer database, inventory database, and personal database.
Data Warehouse
There are many organizations that need to keep all their important data for a long span of time. This is where the importance of the data warehouse comes into play.
Distributed Database
As its name suggests, the distributed databases are meant for those organizations that have different workplace venues and need to have different databases for each location.
End-user Database
To meet the needs of the end-users of an organization, the end-user database is used.
Hierarchical DatabasesIn a hierarchical database management systems (hierarchical DBMSs) model, data is stored in a parent-children relationship nodes. In a hierarchical database, besides actual data, records also contain information about their groups of parent/child relationships.
Network Databases
Network database management systems (Network DBMSs) use a network structure to create relationship between entities. Network databases are mainly used on a large digital computers. Network databases are hierarchical databases but unlike hierarchical databases where one node can have one parent only, a network node can have relationship with multiple entities. A network database looks more like a cobweb or interconnected network of records.
06. Compare and contrast data warehouse with Big data
Data warehousing is one of the common words for last 10-20 years, whereas big data is a hot trend for last 5-10 years. Both of them hold a lot of data, used for reporting, managed by an electronic storage device. So one common thought of maximum people that recent big data will replace old data warehousing very soon. But still, big data and data warehousing is not interchangeable as they used totally for a different purpose. So let us start learning Big Data and Data Warehouse in a detail in this post.
Data
Warehouse
|
Big
Data
|
|
Meaning
|
Mainly an architecture not a technology. It
extracting data from varieties SQL based data source and help for generating
analytic reports. In terms of definition, data repository, which using for
any analytic reports, has been generated from one process, which is nothing
but the data warehouse
|
Big Data is
mainly a technology, which stands on volume, velocity, and variety of the
data. Volumes defines the amount of data coming from different sources,
velocity refers to the speed of data processing, and varieties refers to the
number of types of data
|
preferences
|
Organization wants to know some informed decision, they prefer to
choose data warehousing, as for this kind of report they need reliable or believable
data from the sources
|
If organization need to compare with lot of big data, which contain
valuable information and help them to take better decision, more profitability,
more customers, they obviously preferred big data approach.
|
Accepted
data source
|
Accepted one or more homogeneous or heterogeneous
data source
|
Accepted any kind of sources, including business
transactions, social media and information from sensor or machine specific
data. It can come from DBMS product or
not
|
Accepted
type of formats
|
Handle mainly structural data
|
Accepted all types of formats. Structure data, relational data, and
unstructured data including text documents, email, video, audio, stock ticker
data and financial transaction
|
Subject
Oriented
|
Data warehouse is subject oriented because it provides
information on specific subject not on organization ongoing operation. It mainly
focusses on analysis or displaying data which help on decision making.
|
Big data is also subject oriented, main different is
source of data, as big data can accept and process data from all the sources
including social media, sensor or machine specific data. It also main on
provide exact analysis on data specifically on subject oriented
|
Distributed
file system
|
Processing of huge data in data warehousing is really time consuming
and sometimes it taken entire day for complete the process.
|
This is one of the big utility of big data. HDFS mainly defined to load
huge data in distributed systems by using map reduce program
|
07. Explain how the application components communicate with files and databases
- File – File path, URL
- Using file path or URL we can access to some particular resources and add or modify using application/ Software.
- DB – connection string
- We have to establish the connection string prior to connect to database. After successfully establish connection between Database and application. We can use any functionality to data in Database.
08. Differentiate the SQL statements, Prepared statements, and Callable statements
SQL Statements
Execute standard SQL statements from the application
Statement stmt = con.createStatement();
stmt.executeUpdate(“update STUDENT set NAME =” +
name +
“ where ID =” +
id + “)”;
Prepared statements
The query only needs to be parsed (or prepared) once, but can be executed multiple times with the same or different parameters.
PreparedStatement pstmt = con.prepareStatement("update STUDENT set NAME = ?
where ID = ?");
pstmt.setString(1, "MyName");
pstmt.setInt(2, 111);
pstmt.executeUpdate();
Callable statements
Execute stored procedures
CallableStatement cstmt = con.prepareCall("{call
anyProcedure(?, ?, ?)}");
cstmt.execute();
09. Argue the need for ORM, explaining the development with and without ORM
Object-relational mapping (ORM) is a mechanism that makes it possible to address, access and manipulate objects without having to consider how those objects relate to their data sources.
PROS
- Facilitates implementing domain model pattern.
- Huge reduction in code.
- Takes care of vendor specific code by itself.
- Cache Management — Entities are cached in memory thereby reducing load on the DB.
- Increased startup time due to metadata preparation( not good for desktop applications).
- Huge learning curve without ORM.
- Relatively hard to fine tune and debug generated SQL.Not suitable for applications without a clean domain object model.
JPA
- it is EJB 3.0-compliant;
- it is light-weight;
- it manages persistent data in concert with a JPA entity manager;
- it performs complex business logic;
- it potentially uses several dependent Java objects;
- it can be uniquely identified by a primary key.
POJO
- It doesn’t have special restrictions other than those forced by Java language.
- It doesn’t provide much control on members.
- It can implement Serializable interface.
- Fields can be accessed by their names.
- Fields can have any visiblity.
- There can be a no-arg constructor.
- It is used when you don’t want to give restriction on your members and give user complete access of your entity
JAVA BEAN
- It is a special POJO which have some restrictions.
- It provides complete control on members.
- It should implement serializable interface.
- Fields are accessed only by getters and setters.
- Fields have only private visiblity.
- It must have a no-arg constructor.
- It is used when you want to provide user your entity but only some part of your entity.
- PHP :- CakePHP, CodeIgniter,Doctrine, FuelPHP
- Python :- Django,SQLAlchemy, SQLObject, Storm
- C++ :- ODB, QxOrm
- Java :- ActiveJDBC, ActiveJPA, Apache Cayenne, Apache Gora, Athena Framework, Carbonado
- .NET :- Base One Foundation Component Library, DatabaseObjects, DataObjects.NET, Dapper, ECO, Entity Framework
12. Discuss the need for NoSQL indicating the benefits, also explain different types of NoSQL databases
Benifits
- Schemaless data representation
- Development time
- Speed
- Plan ahead for scalability
NoSQL Database Management Systems
13. Discuss what Hadoop is, explaining the core concepts of it - MongoDB
- Redis
- Couch DB
- RavenDB
- MemcacheDB
- Riak
- Neo4j
Hadoop is the open source project which takes care of all the above points for distributed computing. It is completely based on the concept of Google File System and MapReduce.
14. Explain the concept of IR, identifying tools for IR
Information retrieval, as the name implies, concerns the retrieving of relevant information from databases. It is basically concerned with facilitating the user's access to large amounts of (predominantly textual) information. The process of information retrieval involves the following stages:
- Representing Collections of Documents - how to represent, identify and process the collection of documents.
- User-initiated querying - understanding and processing of the queries.
- Retrieval of the appropriate documents - the searching mechanism used to obtain and retrieve the relevant documents
- Apache Solr
- elasticsearch
- Algolia
- Sphinx (search engine)
- Site Search 360
- OpenSearchServer
- Xapian
- Manticore search
Q1
lecture 08
Q2
https://www.toolsqa.com/sql/data-database-and-database-management-system/
Q3
https://dzone.com/articles/which-is-better-saving-files-in-database-or-in-fil
Q5
https://www.quora.com/What-are-the-different-types-of-databases
Q6
https://www.educba.com/big-data-vs-data-warehouse/
Q7
https://www.oreilly.com/library/view/web-database-applications/0596005431/ch01.html
Q8
https://javaconceptoftheday.com/statement-vs-preparedstatement-vs-callablestatement-in-java/
Q9
https://medium.com/building-the-system/dont-be-a-sucker-and-stop-using-orms-190add65add4
No comments:
Post a Comment