Quick Navigation

HADOOP#1

An open-source framework for distributed storage and processing of large datasets across clusters of computers.

HDFS#2

Hadoop Distributed File System; a scalable and fault-tolerant storage system for managing large datasets.

MAPREDUCE#3

A programming model for processing large data sets with a distributed algorithm on a cluster.

YARN#4

Yet Another Resource Negotiator; manages resources and scheduling for Hadoop applications.

DATA NODE#5

A node in HDFS that stores actual data and serves read/write requests from clients.

NAME NODE#6

The master node in HDFS that manages metadata and regulates access to files.

CLUSTER#7

A group of interconnected computers that work together to process large datasets.

SPLIT#8

A division of input data into manageable chunks for processing in MapReduce.

JOB#9

A unit of work submitted to the Hadoop cluster for processing.

PIG#10

A high-level platform for creating programs that run on Hadoop, using a language called Pig Latin.

HIVE#11

A data warehouse infrastructure built on Hadoop for querying and managing large datasets using SQL-like language.

SPARK#12

An open-source data processing engine that can run on Hadoop, known for its speed and ease of use.

DATA ANALYTICS#13

The science of analyzing raw data to uncover trends, patterns, and insights.

EXPLORATORY DATA ANALYSIS#14

An approach for analyzing datasets to summarize their main characteristics, often using visual methods.

DATA VISUALIZATION#15

The graphical representation of information and data to communicate insights clearly.

FAULT TOLERANCE#16

The ability of a system to continue operating in the event of a failure of some of its components.

REPLICATION#17

The process of duplicating data across multiple nodes in HDFS for reliability and availability.

ETL#18

Extract, Transform, Load; a process for integrating data from multiple sources into a single database.

SCALABILITY#19

The capability of a system to handle a growing amount of work or its potential to accommodate growth.

DATA LAKE#20

A centralized repository that allows you to store all your structured and unstructured data at any scale.

BIG DATA#21

Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations.

DATA SCIENCE#22

A multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge from data.

MACHINE LEARNING#23

A subset of AI that uses statistical techniques to enable computers to improve at tasks with experience.

CASE STUDY#24

An in-depth analysis of a particular instance or example within a real-world context.

API#25

Application Programming Interface; a set of protocols for building and interacting with software applications.