HDFS (Hadoop Distributed File System)

HDFS(Hadoop Distributed File System) is a file system designed for storing very large files with streaming data access patterns,running on clusters of commodity hardware.

HDFS is a filesystem written in Java
--Based on Google’s GFS
Sits on top of a native filesystem
--ext3, xfs etc
Provides redundant storage for massive amounts of data
--Using cheap, unreliable computers
HDFS performs best with a ‘modest’ number of large files
--Millions, rather than billions, of files
--Each file typically 100Mb or more
Files in HDFS are ‘write once’
--No random writes to files are allowed
HDFS is optimized for large, streaming reads of files
--Rather than random reads


How Files Are Stored
Files are split into blocks.
Data is distributed across many machines at load time
--Different blocks from the same file will be stored on different machines
--This provides for efficient MapReduce processing
Blocks are replicated across multiple machines, known as DataNodes
--Default replication is three-fold
  – i.e., each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file, and where those 
--blocks are located
--Known as the metadata

How Files Are Stored:Example

NameNode holds metadata for the data files.
DataNodes hold the actual blocks
--Each block is replicated three times on the cluster


NameNode holds metadata for the data files. DataNodes hold the actual blocks --Each block is replicated three times on the cluster

HDFS:Point To Note
When a client application wants to read a file:
--It communicates with the NameNode to determine which blocks make up the file,and which DataNodes those blocks reside on
--It then communicates directly with the DataNodes to read the data

When a client application wants to read a file: --It communicates with the NameNode to determine which blocks make up the file,and which DataNodes those blocks reside on --It then communicates directly with the DataNodes to read the data

Big Data Analysis With HDFS

HDFS Concepts: Blocks, Replicas,Namenode, Datanode
NameNode manages the File system Namespace

HDFS Concepts: Blocks, Replicas,Namenode, Datanode

HDFS ARCHITECTURE

HDFS ARCHITECTURE

Command line interface

Hdfs File Read

Command line interface-hdfs file read

Hdfs File Write

Command line interface-hdfs file write


Start-up process
-Namenode enters Safemode
  --Replication does not occur in Safemode
-Each Datanode sends Heartbeat 
-Each Datanode sends Blockreport
  --Lists all HDFS data blocks
-Namenode creates Blockmap from Blockreports
-Namenode exits Safemode
-Replicate any under-replicated blocks
Checkpoint process
-Performed by Namenode
-Two versions of FsImage
   --One stored on disk
   --One in memory
-Applies all transactions in EditLog to in-memory FsImage
-Flushes FsImage to disk
-Truncates EditLog
Namenode memory concern
For fast access Namenode keeps all block metadata in-memory
--The bigger the cluster - the more RAM required
--Best for millions of large files (100mb or more) rather than billions
--Will work well for clusters of 100s machines
Hadoop 2+
--Namenode Federations
--Each namenode will host part of the blocks
--Horizontally scale the Namenode
--Support for 1000+ machine clusters
--Yahoo! runs 50,000+ machines
For more detail visit Apache Hadoop
Namenode’s fault tolerance
Namenode daemon process must be running at all times
--If process crashes then cluster is down
Namenode is a single point of failure
--Host on a machine with reliable hardware (ex. sustain a diskfailure)
--Usually is not an issue
Hadoop 2+
--High Availability Namenode
--Active Standby is always running and takes over in case main namenode fails
--Still in its infancy
Source:HDFS

Internet of Things (IOT)

Content
  1. Introduction to IoT
  2. Evolution of IoT
  3. Why IoT?
  4. General Requirements 
  5. Communication Features 
  6. Technologies Involved
  7. Applications
Introduction to IoT Evolution of IoT Why IoT? General Requirements  Communication Features  Technologies Involved Applications
What’s the Internet of Things
--Internet of Things (IoT) is a computing concept which provides interconnection between the uniquely identifiable devices. 
--By integrating several technologies like actuators and sensor networks, identification and tracking technology, enhanced communication protocol and distributed intelligence of smart objects, IoT enables communication between the real time objects present around us.
--From any time ,any place connectivity for anyone, we will now have connectivity for anything!
--Internet of Things (IoT) is a computing concept which provides interconnection between the uniquely identifiable devices.  --By integrating several technologies like actuators and sensor networks, identification and tracking technology, enhanced communication protocol and distributed intelligence of smart objects, IoT enables communication between the real time objects present around us. --From any time ,any place connectivity for anyone, we will now have connectivity for anything!
IOT Structure

History
--In 1997, “The Internet of Things” is the seventh in the series of ITU Internet Reports originally launched in 1997 under the title “Challenges to the Network”.
--1999, Auto-ID Center founded in MIT
--2003, EPC Global founded in MIT
--2005, important technologies of the internet of things was proposed in WSIS conference.
--2008, First international conference of internet of things: The IOT 2008 was held at Zurich.

Cisco’s Prevision about IoT

--Cisco’s Prevision about In 2008 the number of things connected to the Internet was greater than the people living on Earth.
--Within 2020 the number of things connected to the Internet will be about 50 billion.

Evolution of Internet of Things

History --In 1997, “The Internet of Things” is the seventh in the series of ITU Internet Reports originally launched in 1997 under the title “Challenges to the Network”. --1999, Auto-ID Center founded in MIT --2003, EPC Global founded in MIT --2005, important technologies of the internet of things was proposed in WSIS conference. --2008, First international conference of internet of things: The IOT 2008 was held at Zurich.  Cisco’s Prevision about IoT --Cisco’s Prevision about In 2008 the number of things connected to the Internet was greater than the people living on Earth. --Within 2020 the number of things connected to the Internet will be about 50 billion.


Evolution of Internet of Things  report
Gartner Report

Why Internet of Things?
--Dynamic control of industry and daily life
--Improve the resource utilization ratio 
--Better relationship between human and nature
--Forming an intellectual entity by integrating 
human society and physical systems
--Flexible configuration, P&P…
--Universal transport & internetworking
--Accessibility & Usability? 
--Acts as technologies integrator 

Visions of Internet of Things
 Visions of Internet of Things
IoT General Requirements
IoT General Requirements

IoT Communication Features
IoT Communication Features

Technologies Involved

--Communication
--Backbone
--Hardware
--Protocols
--Software
--Data Brokers/Cloud
--Platforms
--Machine Learning
Technologies Involved --Communication --Backbone --Hardware --Protocols --Software --Data Brokers/Cloud --Platforms --Machine Learning

Communication
Technologies Involved --Communication --Backbone --Hardware --Protocols --Software --Data Brokers/Cloud --Platforms --Machine Learning
RFID
RFID-A radio-frequency identification system uses tags, or labels attached to the objects to be identified. Two-way radio transmitter-receivers called interrogators or readers send a signal to the tag and read its response.
-RFID tags can be either passive, active or battery assisted passive.
-Frequency: 120–150 kHz (LF), 13.56 MHz (HF), 433 MHz (UHF)
EnOcean-Range: 10cm to 200m





EnOcean
-ISO/IEC14543-3-10 (Alliance)
-A The EnOcean technology is an energy harvesting wireless technology used primarily in building automation systems; but is also applied to other applications in industry, transportation, logistics and smart homes
-Frequency: 315 MHz, 868 MHz, 902 MHz
-Range: 300m Outdoor, 30m Indoors
EnOcean

NFC
-ISO/IEC18092 and ISO/IEC 14443-2,3,4
-NFC is a set of short-range wireless technologies, typically requiring a distance of 10 cm or less.
-NFC always involves an initiator and a target; the initiator actively generates an RF field that can power a passive target.
-Frequency: 13.56 MHz
-Range: < 0.2 m
NFC
Bluetooth
-Bluetooth is a wireless technology standard for exchanging data over short distances (using short-wavelength radio transmissions in the ISM band.
-Frequency: 2.4GHz
-Range: 1-100m
Bluetooth
WiFi (Alliance)
-The Wi-Fi Alliance defines Wi-Fi as any "wireless local area network (WLAN) products that are based on the Institute of Electrical and Electronics Engineers' (IEEE) 802.11 standards.
-Frequency: 2.4 GHz, 3.6 GHz and 4.9/5.0 GHz
-Range: Common range is up to 100m but can be extended.
WiFi (Alliance)
Weightless (SIG)
-Weightless is a proposed proprietary open wireless technology standard for exchanging data between a base station and thousands of machines around it using White space with high levels of security.
Frequency:Varies with legislation (470 – 790MHz)
Range: Up to 10km
Data Rates: 1kbits/s to 10Mbits/s
Weightless (SIG)

GSM (Association)
-GSM (Global System for Mobile communications) is an open, digital cellular technology used for transmitting mobile voice and data services.
-Frequency:Europe:900MHz & 1.8GHz , US: 1.9GHz & 850MHz
-Data Rates: 9.6 kbps
GSM (Association)

Additional: 
3G  
4G LTE 
Dash7 
Ethernet 
GPRS 
PLC / Powerline 
QR Codes, 
EPC 
WiMax 
X-10 
802.15.4 
Z-Wave 
Zigbee 

Backbone
IPv6
--Internet Protocol version 6 (IPv6) is the latest revision of the Internet Protocol (IP), the communications protocol that provides an identification and location system for computers on networks and routes traffic across the Internet.
--IPv6 uses a 128-bit address, allowing 2128, or approximately 3.4×1038 addresses, or more than 7.9×1028 times as many as IPv4, which uses 32-bit addresses.
UDP and TCP
--With UDP, computer applications can send messages, in this case referred to as data-grams, to other hosts on an Internet Protocol (IP) network without prior communications to set up special transmission channels or data paths.
--The Transmission Control Protocol (TCP) is intended for use as a highly reliable host-to-host protocol between hosts in packet-switched computer communication networks.
6LoWPAN
--6LoWPAN is an acronym of IPv6 over Low power Wireless Personal Area Networks. 
--The 6LoWPAN group has defined encapsulation and header compression mechanisms that allow IPv6 packets to be sent to and received from over IEEE 802.15.4 based networks.
--It contain issues such as small packet sizes, low bandwidth, low power, large volumes of devices, unreliability from radio connectivity issues, battery drain, device lockups, and physical tampering.

Hardware
-Wireless SoC (system on chip)
-Self-contained,RF-certified module solutions that have TCP, UDP and IP on chip.
-Manufactures:  Gainspan, Wiznet, Nordic Semiconductor, TI
-Prototyping boards and platforms
--Arduino
--Raspberry Pi
--BeagleBone Black
-These are communities and prototyping platforms available that are making its possible to create your own Internet of Things project.
Sensors
-Sensors are used to obtain measurements of physical parameters such as the presence of certain biological entities (biosensors), wavelengths of light (image sensors), and flow velocity (thermal flow sensors) etc.

Software
Riot OS
-RIOT OS is an operating system for Internet of Things (IoT) devices. It is based on a microkernel and designed for energy efficiency, hardware independent development, a high degree of modularity
Riot OS

ThingsSquare Mist
-The Thingsquare Mist is open source firmware exceptionally lightweight, battle-proven, and works with multiple microcontrollers with a range of radios.
ThingsSquare Mist


Protocols
CoAP
-Constrained Application Protocol (CoAP) is an application layer protocol that is intended for use in resource-constrained internet devices, such as WSN nodes.

MQTT
-Message Queue Telemetry Transport (MQTT) is an open message protocol for M2M communications that enables the transfer of telemetry-style data in the form of messages from pervasive devices, along high latency or constrained networks, to a server or small message broker.

XMPP
-The Extensible Messaging and Presence Protocol (XMPP) is an open technology for real-time communication.
-It powers a wide range of applications including instant messaging, presence, multi-party chat, voice and video calls, collaboration, lightweight middleware, content syndication, and generalized routing of XML data.

RESTful HTTP
-Representational State Transfer (REST) is a style of software architecture for distributed systems such as the World Wide Web. REST has emerged as a predominant web API design model.

Data Brokers/Cloud Services
ThingWorx
-It provides a complete application design, runtime, and intelligence environment - allowing organizations to rapidly create M2M applications
ThingWorx\
EVRYTHNG
-The EVRYTHNG Engine provides high scale, industrial technology to create and serve millions of Active Digital Identities™ for a company’s products and other objects. These unique online profiles create a persistent, unique digital presence for any physical object on the Web.   
EVRYTHNG
Sense
-Open.Sen.se an open platform for all those who want to imagine, prototype and test new Devices, Installations, Scenarios, Applications for this globally interconnected and immersive world.
Sense
Grok Engine
-Grok is software that breaks this bottleneck with three unique capabilities: a high level of automation in analyzing streaming data, the ability to learn continuously from data, and the ability to drive action from the output of Grok's data models.
Grok Engine

Characteristics of Most Relevant Standardization Activities
Characteristics of Most Relevant Standardization Activities

Middleware Architecture of IoT


SOA based architecture for IoT middleware

Technology Roadmap of Internet of Things
Technology Roadmap of Internet of Things

Applications of IoT
Applications of IoT
Management
Retail
Food
Education
Pharmaceuticals
Security
Transport and Logistics
Smart Cities
Smart Manufacturing
Daily life and domotics
Management Retail Food Education Pharmaceuticals Security Transport and Logistics Smart Cities Smart Manufacturing Daily life and domotics
Management
-Data Management
-Waste Management
-Urban Planning
-Production Management
Management -Data Management -Waste Management -Urban Planning -Production Management

Retail
-Intelligent Shopping
-Bar Code in Retail
-Electronic Tags
Retail -Intelligent Shopping -Bar Code in Retail -Electronic Tags
-Intelligent tags for drugs
-Drug usage tracking
-Enable the emergency treatment to be given faster and more correct
Pharmaceuticals -Intelligent tags for drugs -Drug usage tracking -Enable the emergency treatment to be given faster and more correct

FOOD
-Control geographical origin
-Food production management
-Prevent overproduction and shortage
-Control food quality, health and safety. 
FOOD  -Control geographical origin -Food production management -Nutrition calculations -Prevent overproduction and shortage -Control food quality, health and safety.
EDUCATION
-School Administration
-Attendance Management
-Voting System
-Automatic Feedback 
-Instructional Technology
-Media 
-Information management
-Foreign language learning
EDUCATION -School Administration -Attendance Management -Voting System -Automatic Feedback  -Instructional Technology -Media  -Information management -Foreign language learning

AUTOBOT
-Diagnostics service for cars
-Alerts relatives in case of an accident
-Discovery service of car position
-Integrated with several web services
AUTOBOT -Diagnostics service for cars -Alerts relatives in case of an accident -Discovery service of car position -Integrated with several web services
Transportation
-ConLock
-ContainerSafe
-Integration of light sensors GPS and GSM
Transportation -ConLock -ContainerSafe -Integration of light sensors GPS and GSM
Smart Cities
-Residential E-meters
-Smart street lights
-Pipeline leak detection
-Traffic control
-Surveillance cameras
-Centralized and integrated system control
Smart Cities -Residential E-meters -Smart street lights -Pipeline leak detection -Traffic control -Surveillance cameras -Centralized and integrated system control
Smart Manufacturing
-Flow optimization
-Real time inventory 
-Asset tracking
-Employee safety
-Predictive maintenance
-Firmware updates
Smart Manufacturing -Flow optimization -Real time inventory  -Asset tracking -Employee safety -Predictive maintenance -Firmware updates

Daily Life and Domotics
Daily Life and Domotics-iot
Source:IOT

MapReduce (Mapper, Reducer)

What is MapReduce
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
--Where possible
Consists of two phases:
--Map
--Reduce

Features of MapReduce

Automatic parallelization and distribution
Fault-tolerance
Status and monitoring tools
A clean abstraction for programmers
--MapReduce programs are usually written in Java
MapReduce abstracts all the ‘housekeeping’ away from the developer 
--Developer can concentrate simply on writing the Map and Reduce functions

MapReduce: The Mapper
Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic
--Multiple Mappers run in parallel, each processing a portion of the input data
The Mapper reads data in the form of key/value pairs
It outputs zero or more key/value pairs:

map(in_key, in_value)->(inter_key, inter_value) list

The Mapper may use or completely ignore the input key
--For example, a standard pattern is to read a line of a file at a time
--The key is the byte offset into the file at which the line starts
--The value is the contents of the line itself
--Typically the key is considered irrelevant
If it writes anything at all out, the output must be in the form of key/value pairs

MapReduce Example: Word Count
Count the number of occurrences of each word in a large amount of input data:
Map(input_key, input_value) 
  foreach word w in input_value:
    emit(w, 1)
Input to the mapper
(3414, 'the cat sat on the mat') 
(3437, 'the aardvark sat on the sofa')
output from mapper
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

MapReduce:The Reducer
After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list
This list is given to a Reducer
--There may be a single Reducer, or multiple Reducers
--All values associated with a particular intermediate key are guaranteed to go to the same Reducer
--The intermediate keys, and their value lists, are passed to the Reducer in sorted key order
--This step is known as the ‘shuffle and sort’
The Reducer outputs zero or more final key/value pairs
--These are written to HDFS
--In practice, the Reducer usually emits a single key/value pair for each input key

Example Reducer: Sum Reducer
Add up all the values associated with each intermediate key:
reduce(output_key,intermediate_vals)
foreach v in intermediate_vals:
  count += v
emit(output_key, count)
Reducer output
('aardvark',1)
('mat',1)
('on',2)
('sat',2)
('sofa',1)
('the',4)




Big Data Analytics with MapReduce

Map-Reduce Example






Parallel using multiple machines

Applications
Machine Learning
--Apache Mahout
--Scalable machine learning library on top of hadoop.
Scientific calculations
Apache Hama
--Bulk Synchronous Parallel framework on top of HDFS for massive -----scientific calculations such as matrix, graph and network   algorithms.
Graph processing
--Apache Giraph
--an iterative graph processing system built for high scalability.
Image Processing 
HIPI – Hadoop Image Processing Interface.
Bioinformatics
--BLAST, SOM

Hadoop Terminology
Job
a full program – an execution of a Mapper and Reducer across data set
Task
an execution of a mapper or reducer on a slice of data
Task Attempt
a particular instance of an attempt to execute a task on a machine

How Hadoop runs a Map-Reduce job?

MR Flow: Key Value pairs


Hadoop MR data flow
Hadoop MR data flow

Failure
Task Failure
Task tracker Failure
Job tracker Failure

Features
Combiner
Speculative Execution
Job Scheduling
--Fair Scheduler
--Capacity Scheduler
Counter
Distributed Cache

Key and Value Types
Utilizes Hadoop’s serialization mechanism for writing data in and out of network database or files
--Optimized for network serialization
--A set of basic types is provided
--Easy to implement your own
Extends Writable interface
--Framework’s serialization mechanisms
--Defines how to read and write fields
--org.apache.hadoop.io package
Keys must implement WritableComparable interface
--Extends Writable and java.lang.Comparable<T>
--Required because keys are sorted prior reduce phase
Hadoop is shipped with many default implementations of WritableComparable<T>
--Wrappers for primitives (String, Integer, etc...)
--Or you can implement your own

WritableComparable<T> Implementations
WritableComparable<T> Implementations

Implement Custom WritableComparable<T>
Implement 3 methods
write(DataOutput)
--Serialize your attributes
readFields(DataInput)
--De-Serialize your attributes
compareTo(T)
--Identify how to order your objects
If your custom object is used as the key it will be sorted prior to reduce phase

Framework’s Usage of InputFormat Implementation
Framework’s Usage of InputFormat Implementation-mapreduce

InputFormat
Framework’s Usage of InputFormat Implementation-mapreduce


OutputFormat
mapreduce OutputFormat


MR Design patterns
A template for solving a common and general
data manipulation problem with MapReduce

Summarization: get a top-level view by summarizing and grouping data
Filtering: view data subsets such as records generated from one user
Data Organization: reorganize data to work with other systems, or to make MapReduce analysis easier
Join : analyze different datasets together to discover interesting relationships
Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
Input and Output: customize the way you use Hadoop to load or store data

Source: MapReduce