The Big-Data Ecosystem Table

Incomplete-but-useful list of big-data related projects packed into a JSON dataset.

by Andrea Mostosi (http://blog.andreamostosi.name)

Frameworks
Apache Hadoop

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)

1. Apache Hadoop
Distributed Programming
AddThis Hydra

Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).

1. Github
Akela

Mozilla’s utility library for Hadoop, HBase, Pig, etc.

1. Website
Amazon Lambda

a compute service that runs your code in response to events and automatically manages the compute resources for you

1. Website
Amazon SPICE

Super-fast Parallel In-memory Calculation Engine

1. Website
AMPcrowd

A RESTful web service that runs microtasks across multiple crowds

1. Website
AMPLab G-OLA

a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques

1. Website
AMPLab SIMR

Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scala installed on any of the nodes.

1. SIMR on GitHub
Apache Crunch

is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

1. Website
Apache DataFu

DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles, sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank, sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.

1. DataFu Apache Incubator
2. LinkedIn DataFu
Apache Flink

high-performance runtime, and automatic program optimization

1. Website
Apache Gora

framework for in-memory data model and persistence

1. Apache Gora
Apache Hama

Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data analysis techniques such as machine learning and graph algorithms require iterative computations, this is where Bulk Synchronous Parallel model can be more effective than “plain” MapReduce.

1. Hama site
Apache Ignite

high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time

1. Website
Apache MapReduce

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

1. Apache MapReduce
2. Google MapReduce paper
3. Writing YARN applications
Apache Pig

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.

1. pig.apache.org/
2. Pig examples by Alan Gates
Apache S4

S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

1. Apache S4
Apache Spark

Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.

1. Apache Incubator Spark
Apache Spark Streaming

framework for stream processing, part of Spark

1. Apache Spark Streaming
Apache Storm

Storm is a complex event processor and distributed computation framework written predominantly in the Clojure programming language. Is a distributed real-time computation system for processing fast, large streams of data. Storm is an architecture based on master-workers paradigma. So a Storm cluster mainly consists of a master and worker nodes, with coordination done by Zookeeper.

1. Storm Project/
2. Storm-on-YARN
Apache Tez

Tez is a proposal to develop a generic application which can be used to process complex data-processing task DAGs and runs natively on Apache Hadoop YARN.

1. Apache Tez
Apache Twill

Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their business logic. Twill uses a simple thread-based model that Java programmers will find familiar. YARN can be viewed as a compute fabric of a cluster, which means YARN applications like Twill will run on any Hadoop 2 cluster.

1. Apache Twill Incubator
Arvados

Spins a web of microservices around unsuspecting sysadmins

1. Website
Blaze

Python users high-level access to efficient computation on inconveniently large data

1. Website
Cascalog

data processing and querying library

1. Cascalog
Cheetah

High Performance, Custom Data Warehouse on Top of MapReduce

1. Paper
Concurrent Cascading

Application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop.

1. Cascanding
Damballa Parkour

Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete access to absolutely everything possible in raw Java Hadoop MapReduce.

1. Parkour GitHub Project
Datasalt Pangool

A new MapReduce paradigm. A new API for MR jobs, in higher level than Java.

1. Website
DataTorrent StrAM

real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.

1. Website
DistributedR

scalable high-performance platform for the R language

1. Website
Drools

a Business Rules Management System (BRMS) solution

1. Website
eBay Oink

REST based interface for PIG execution

1. Website
2. Website
Esper

a highly scalable, memory-efficient, in-memory computing, SQL-standard, minimal latency, real-time streaming-capable Big Data processing engine for historical data

1. Website
Facebook Corona

“The next version of Map-Reduce“ from Facebook, based in own fork of Hadoop. The current Hadoop implementation of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets. The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook engineers looked at but discounted because of the highly-customised nature of the company’s deployment of Hadoop and HDFS. Corona, like YARN, spawns multiple job trackers (one for each job, in Corona’s case).

1. Website
Facebook Peregrine

Map Reduce framework

1. Facebook Peregrine
Facebook Scuba

distributed in-memory datastore

1. Website
GearPump

a lightweight real-time big data streaming engine

1. Website
Geotrellis

geographic data processing engine for high performance applications

1. Website
2. Website
GetStream Stream Framework

a Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis

1. Website
GIS Tools for Hadoop

Big Data Spatial Analytics for the Hadoop Framework

1. Website
Google Dataflow

create data pipelines to help themæingest, transform and analyze data

1. Website
Google FlumeJava

Easy, Efficient Data-Parallel Pipelines. Base of Google Dataflow

1. Website
Google MapReduce

map reduce framework

1. Website
Google MillWheel

fault tolerant stream processing framework

1. Website
GraphLab Dato

fast, scalable engine of GraphLab Create, a Python library

1. Website
Hazelcast

In-Memory Data Grid

1. Website
HParser

data parsing transformation environment optimized for Hadoop

1. Website
IBM Streams

advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources

1. Website
JAQL

JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. As its name implies, a primary use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data. For example, it can support XML, comma-separated values (CSV) data and flat files. A “SQL within JAQL” capability lets programmers work with structured SQL data while employing a JSON data model that’s less restrictive than its Structured Query Language counterparts.

1. JAQL in Google Code
2. What is Jaql? by IBM
Kite

is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

1. Website
Kryo

Java serialization and cloning: fast, efficient, automatic

1. Website
LinkedIn Cubert

a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop

1. Website
Lipstick

Pig workflow visualization tool

1. Website
Metamarkers Druid

Realtime analytical data store.

1. Druid
Microsoft Azure Stream Analytics

an event processing engine that helps uncover real-time insights from devices, sensors, infrastructure, applications and data

1. Website
Microsoft Orleans

a straightforward approach to building distributed high-scale computing applications

1. Website
Microsoft Project Orleans

a framework that provides a straightforward approach to building distributed high-scale computing applications

1. Website
Microsoft Trill

a high-performance in-memory incremental analytics engine

1. Website
Netflix Aegisthus

Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family

1. Website
Netflix Lipstick

Pig Visualization framework

1. Website
Netflix Mantis

Event Stream Processing System

1. Website
Netflix PigPen

PigPen is map-reduce for Clojure whiche compiles to Apache Pig. Clojure is dialect of the Lisp programming language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine, Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media.

1. PigPen on GitHub
Netflix STAASH

language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems

1. Website
Netflix Surus

a collection of tools for analysis in Pig and Hive

1. Website
Netflix Zeno

Netflix’s In-Memory Data Propagation Framework

1. Website
Nextflow

Dataflow oriented toolkit for parallel and distributed computational pipelines

1. Website
Nokia Disco

MapReduce framework developed by Nokia

1. Nokia Disco
Oryx

is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine learning

1. Website
Pachyderm

lets you store and analyze your data using containers.

1. Website
Parsely Streamparse

streamparse lets you run Python code against real-time streams of data. It also integrates Python smoothly with Apache Storm.

1. Website
PigPen

PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it

1. Website
Pinterest Pinlater

asynchronous job execution system

1. Website
Pubnub

Data stream network

1. Website
Pydoop

Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++ Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available.

1. SF Pydoop site
2. Pydoop GitHub Project
ScaleOut hServer

fast, scalable in-memory data grid for Hadoop

1. Website
SeqPig

Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

1. Website
SigmoidAnalytics Spork

Pig on Apache Spark

1. Website
SNAP

Stanford Network Analysis Platform is a general purpose, high performance system for analysis and manipulation of large networks

1. Website
spark-dataflow

allows users to execute dataflow pipelines with Spark

1. Website
SpatialHadoop

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

1. Website
Spring for Apache Hadoop

unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive

1. Website
SQLStream Blaze

stream processing platform

1. Website
Stratio Crossdata

provides an unified way to access to multiple datastores

1. Website
Stratio Decision

the union of a real-time messaging bus with a complex event processing engine using Spark Streaming

1. Website
Stratio Streaming

the union of a real-time messaging bus with a complex event processing engine using Spark Streaming

1. Website
Stratosphere

Stratosphere is a general purpose cluster computing framework. It is compatible to the Hadoop ecosystem: Stratosphere can access data stored in HDFS and runs with Hadoop’s new cluster manager YARN. The common input formats of Hadoop are supported as well. Stratosphere does not use Hadoop’s MapReduce implementation: it is a completely new system that brings its own runtime. The new runtime allows to define more advanced operations that include more transformations than just map and reduce. Additionally, Stratosphere allows to express analysis jobs using advanced data flow graphs, which are able to resemble common data analysis task more naturally.

1. Stratosphere site
Streamdrill

usefull for counting activities of event streams over different time windows and finding the most active one

1. Website
Succinct Spark

Enabling Queries on Compressed Data

1. Website
Sumo Logic

cloud based analyzer for machine-generated data.

1. Website
Teradata QueryGrid

data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop

1. Website
TIBCO ActiveSpaces

in-memory data grid

1. Website
Tigon

a distributed framework built on Apache HadoopTM and Apache HBaseTM for real-time, high-throughput, low-latency data processing and analytics applications

1. Website
Torch

Scientific computing for LuaJIT

1. Website
Trident

a high-level abstraction for doing realtime computing on top of Storm

1. Website
Twitter Crane

Java ETL

1. Website
Twitter Gizzard

a flexible sharding framework for creating eventually-consistent distributed datastores

1. Website
Twitter Heron

a realtime, distributed, fault-tolerant stream processing engine from Twitter

1. Website
Twitter Scalding

Scala library for Map Reduce jobs, built on Cascading

1. Twitter Scalding
Twitter Summingbird

a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird.

1. Summingbird
Twitter TSAR

TimeSeries AggregatoR by Twitter

1. Website
2. Website
Distributed Filesystem
Amazon Elastic File System

file storage service for Amazon Elastic Compute Cloud (Amazon EC2) instances

1. Website
Amazon Simple Storage Service

secure, durable, highly-scalable object storage

1. Website
Apache HDFS

The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.

1. hadoop.apache.org
2. Google FileSystem - GFS Paper
3. Cloudera Why HDFS
4. Hortonworks Why HDFS
Apache Kudu

completes Hadoop’s storage layer to enable fast analytics on fast data

1. Website
BeeGFS

formerly FhGFS, parallel distributed file system

1. Website
Ceph Filesystem

Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph’s main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available. The data is replicated, making it fault tolerant. The problem right now is Ceph currently requires Hadoop 1.1.X stable series.

1. Ceph Filesystem site
2. Ceph and Hadoop
3. HADOOP-6253
Disco DDFS

distributed filesystem

1. Website
Facebook Haystack

object storage system

1. Facebook Haystack
Google Cloud Storage

durable and highly available object storage

1. Website
Google Cloud Storage Nearline

a highly available, affordable solution for backup, archiving and disaster recovery.

1. Website
Google Colossus

distributed filesystem (GFS2)

1. Website
Google GFS

distributed filesystem

1. Website
Google Megastore

scalable, highly available storage

1. Website
GridGain

GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data and computations into memory. This work is done with the GGFS - Hadoop compliant in-memory file system. For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS. Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon: GGFS allows read-through and write-through to/from underlying HDFS or any other Hadoop compliant file system with zero code change. Essentially GGFS entirely removes ETL step from integration.GGFS has ability to pick and choose what folders stay in memory, what folders stay on disc, and what folders get synchronized with underlying (HD)FS either synchronously or asynchronously. GridGain is working on adding native MapReduce component which will provide native complete Hadoop integration without changes in API, like Spark currently forces you to do. Essentially GridGain MR+GGFS will allow to bring Hadoop completely or partially in-memory in Plug-n-Play fashion without any API changes.

1. GridGain site
HDSF-DU

HDFS-DU is an interactive visualization of the Hadoop distributed file system.

1. Website
Lustre file system

The Lustre filesystem is a high-performance distributed filesystem intended for larger network and high-availability environments. Traditionally, Lustre is configured to manage remote data storage disk devices within a Storage Area Network (SAN), which is two or more remotely attached disk devices communicating via a Small Computer System Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.

1. wiki.lustre.org/
2. Hadoop with Lustre
3. Intel HPC Hadoop
MapR-FS

Distributed filesystem from MapR

1. Website
Microsoft Azure Data Lake

a hyper scale repository for big data analytic workloads

1. Website
Netflix S3mper

library that provides an additional layer of consistency checking on top of Amazon’s S3 index through use of a consistent, secondary index

1. Website
Quantcast File System QFS

(QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters. It is written in C++ and has fixed-footprint memory management. QFS uses Reed-Solomon error correction as method for assuring reliable access to data.

1. QFS site
2. GitHub QFS
3. HADOOP-8885
Red Hat GlusterFS

GlusterFS is a scale-out network-attached storage file system. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was announced as a commercially-supported integration of GlusterFS with Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server.

1. www.gluster.org
2. Red Hat Hadoop Plugin
Tachyon

Tachyon is an memory distributed file system. By storing the file-system contents in the main memory of all cluster nodes, the system achieves higher throughput than traditional disk-based storage systems like HDFS.

1. Tachyon site
Key-Map Data Model
Actian Vector

column-oriented analytic database

1. Actian website
Apache Accumulo

Distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Accumulo is software created by the NSA with security features.

1. Apache Accumulo
Apache Cassandra

Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra. This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra). HBase and its required supporting systems are derived from what is known of the original Google BigTable and Google File System designs (as known from the Google File System paper Google published in 2003, and the BigTable paper published in 2006). Cassandra on the other hand is a recent open source fork of a standalone database system initially coded by Facebook, which while implementing the BigTable data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon).

1. Apache Cassandra
Apache HBase

Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables

1. Apache HBase
Facebook HydraBase

Evolution of HBase made by Facebook

1. Blog Post on Facebook engineer
Google BigTable

column-oriented distributed datastore

1. Google BigTable
Google Cloud Datastore

is a fully managed, schemaless database for storing non-relational data built on top of Google’s BigTable infrastructure

1. Google Cloud Datastore site
2. Google App Engine Datastore
3. Matering Datastore
Hypertable

Database system inspired by publications on the design of Google’s BigTable. The project is based on experience of engineers who were solving large-scale data-intensive tasks for many years. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++. Sposored by Baidu the Chinese search engine.

1. HyperTable
InfiniDB

is accessed through a MySQL interface and use massive parallel processing to parallelize queries

1. Website
MapR-DB

fast, scalable, and enterprise-ready in-Hadoop database architected to manage big data

1. Website
Netflix Priam

Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra

1. Website
OhmData C5

improved version of HBase

1. OhmData website
Palantir AtlasDB

a massively scalable datastore and transactional layer that can be placed on top of any key-value store to give it ACID properties

1. Website
Sqrrl

NoSQL databases on top of Apache Accumulo

1. Website
Stratio Cassandra

Cassandra index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and multivariable, geospatial and bitemporal search

1. Website
Tephra

Transactions for HBase

1. Website
Twitter Manhattan

real-time, multi-tenant distributed database for Twitter scale

1. Blog post on Twitter Engineering blog
Document Data Model
Actian Versant

commercial object-oriented database management systems

1. Website
Amazon SimpleDB

a highly available and flexible non-relational data store that offloads the work of database administration

1. Website
BigchainDB

The scalable blockchain database.

1. Website
Clusterpoint

a database software for high-speed storage and large-scale processing of XML and JSON data on clusters of commodity hardware

1. Website
Crate Data

is an open source massively scalable data store. It requires zero administration.

1. Website
Facebook Apollo

Facebook’s Paxos-like NoSQL database

1. infoQ post
2. Website
jumboDB

document oriented datastore over Hadoop

1. jumboDB
LinkedIn Ambry

Distributed object store

1. Website
LinkedIn Espresso

horizontally scalable document-oriented NoSQL data store

1. LinkedIn Espresso
MarkLogic

Schema-agnostic Enterprise NoSQL database technology

1. Website
Microsoft DocumentDB

fully-managed, highly-scalable, NoSQL document database service

1. Website
Microsoft StorSimple

a unique hybrid cloud storage solution that lowers costs and improves data protection

1. Website
MongoDB

Document-oriented database system. It is part of the NoSQL family of database systems. Instead of storing data in tables as is done in a “classical” relational database, MongoDB stores structured data as JSON-like documents

1. Mongodb site
RavenDB

A transactional, open-source Document Database

1. Website
RethinkDB

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

1. RethinkDB site
Terrastore

a modern document store which provides advanced scalability and elasticity features without sacrificing consistency

1. Website
TokuMX

High-Performance MongoDB Distribution

1. Website
Tokutek

Tokutek claims to improve MongoDB performance 20x

1. Website
Key-value Data Model
Aerospike

NoSQL flash-optimized, in-memory. Open source and “Server code in ‘C’ (not Java or Erlang) precisely tuned to avoid context switching and memory copies.

1. Website
Amazon DynamoDB

distributed key/value store, implementation of Dynamo

1. Amazon DynamoDB
Couchbase ForestDB

Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie

1. Website
Edis

Edis is a protocol-compatible Server replacement for Redis, written in Erlang. Edis’s goal is to be a drop-in replacement for Redis when persistence is more important than holding the dataset in-memory. Edis (currently) uses Google’s leveldb as a backend. Future plans call for a multi-master clustering model. Near term goals are to act as a read-slave for existing Redis servers.

1. Website
ElephantDB

Distributed database specialized in exporting data from Hadoop

1. ElephantDB
EventStore

An open-source, functional database with support for Complex Event Processing. It provides a persistence engine for applications using event-sourcing, or for storing time-series data. Event Store is written in C#, C++ for the server which runs on Mono or the .NET CLR, on Linux or Windows. Applications using Event Store can be written in JavaScript.

1. EventStore
2. Website
Exasolution

an in-memory, column-oriented, relational database management system

1. Website
HyperDex

next generation key-value store

1. Website
KAI

a distributed key-value datastore

1. Website
LinkedIn Krati

is a simple persistent data store with very low latency and high throughput. It is designed for easy integration with read-write-intensive applications with little effort in tuning configuration, performance and JVM garbage collection.

1. Website
Linkedin Voldemort

Distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage.

1. LinkedIn Voldemort
MemcacheDB

a distributed key-value storage system designed for persistent

1. Website
Netflix Dynomite

thin Dynamo-based replication for cached data

1. Website
Oracle NoSQL Database

distributed key-value database by Oracle Corporation

1. Website
QDB

A fast, high availability, fully Redis compatible store

1. Website
RAMCloud

storage system that provides large-scale low-latency storage by keeping all data in DRAM all the time and aggregating the main memories of thousands of servers

1. Website
RebornDB

Distributed database fully compatible with redis protocol

1. Website
Redis

Redis is an open-source, networked, in-memory, key-value data store with optional durability. It is written in ANSI C. In its outer layer, the Redis data model is a dictionary which maps keys to values. One of the main differences between Redis and other structured storage systems is that Redis supports not only strings, but also abstract data types. Sponsored by Pivotal and VMWare. It’s BSD licensed.

1. Redis.io
2. Website
Redis Cluster

distributed implementation of Redis

1. Website
Redis Sentinel

system designed to help managing Redis instances

1. Website
Riak

a decentralized datastore.

1. Website
Scalaris

a distributed transactional key-value store

1. Website
Storehaus

library to work with asynchronous key value stores, by Twitter

1. Storehaus
Tarantool

an efficient NoSQL database and a Lua application server.

1. Website
TreodeDB

key-value store that’s replicated and sharded and provides atomic multirow writes

1. Website
Yahoo Sherpa

hosted, distributed and geographically replicated key-valueÊcloud storage platform

1. Website
Graph Data Model
Apache Giraph

Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google

1. Apache Giraph
Apache Spark Bagel

implementation of Pregel, part of Spark

1. Apache Spark Bagel
ArangoDB

An open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript extensions.

1. ArangoDB site
Doradus

Doradus is a REST service that extends a Cassandra NoSQL database with a graph-based data model, advanced indexing and search features, and a REST API

1. Website
Facebook TAO

TAO is the distributed data store that is widely used at facebook to store and serve the social graph. The entire architecture is highly read optimized, supports a graph data model and works across multiple geographical regions.

1. Post about TAO
Faunus

Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster

1. Website
Google Cayley

open-source graph database.

1. Website
Google Pregel

graph processing framework

1. Website
GraphLab PowerGraph

a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies.

1. Graphlab website
GraphX

A Resilient Distributed Graph System on Spark

1. GraphX
Gremlin

graph traversal Language.

1. Website
HyperGraphDB

general purpose, open-source data storage mechanism based on a powerful knowledge management formalism known as directed hypergraphs

1. Website
InfiniteGraph

distributed graph database

1. Website
Infovore

RDF-centric Map/Reduce framework

1. Website
Intel GraphBuilder

library which provides tools to construct large-scale graphs on top of Apache Hadoop

1. Website
MapGraph

Massively Parallel Graph processing on GPUs

1. Website
Mazerunner for Neo4j

extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

1. Website
MemGraph

cypher compatibile, high-performance in-memory transactional and real-time analytics graph database

1. Website
Microsoft Graph Engine

a distributed, in-memory, large graph processing engine, underpinned by a strongly-typed RAM store and a general computation engine

1. Website
Neo4j

An open-source graph database writting entirely in Java. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables.

1. Neo4j site
OrientDB

It is an Open Source NoSQL DBMS with the features of both Document and Graph DBMSs. Written in Java, it is incredibly fast: it can store up to 150,000 records per second on common hardware.

1. OrientDB site
Phoebus

framework for large scale graph processing

1. Phoebus
Pinterest Zen

Pinterest’s Graph Storage Service

1. Website
Sparksee

scalable high-performance graph database

1. Website
Stardog

graph database: search, query, reasoning, and constraints in a lightweight, pure Java system

1. Website
Titan

distributed graph database, built over Cassandra

1. Titan
Twitter FlockDB

distribuited graph database

1. Twitter FlockDB
NewSQL Databases
Actian Ingres

commercially supported, open-source SQL relational database management system

1. Website
BayesDB

BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.

1. BayesDB site
Cockroach

Scalable, Geo-Replicated, Transactional Datastore

1. Website
Datomic

distributed database designed to enable scalable, flexible and intelligent applications.

1. Website
FoundationDB

distributed database, inspired by F1, aquired Akiban server

1. FoundationDB
2. Akiban Server
Google F1

distributed SQL database built on Spanner

1. Website
Google Spanner

globally distributed semi-relational database

1. Website
H-Store

is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications. It is a highly distributed, row-store-based relational database that runs on a cluster on shared-nothing, main memory executor nodes.

1. Brown project website
HandlerSocket

HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine of MySQL). It works as a daemon inside the mysqld process, accepting TCP connections, and executing requests from clients. HandlerSocket does not support SQL queries. Instead, it supports simple CRUD operations on tables. HandlerSocket can be much faster than mysqld/libmysql in some cases because it has lower CPU, disk, and network overhead.

1. Website
IBM DB2

object-relational database management system

1. Website
InfiniSQL

infinity scalable RDBMS

1. InfiniSQL
MemSQL

in memory SQL database witho optimized columnar storage on flash

1. MemSQL site
NuoDB

SQL/ACID compliant distributed database

1. NuoDB
Oracle Database

object-relational database management system

1. Website
Oracle TimesTen in-Memory Database

in-memory, relational database management system with persistence and recoverability

1. Website
Pivotal GemFire XD

Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.

1. Website
SAP HANA

is an in-memory, column-oriented, relational database management system

1. Website
Segment SQL

Track your customer data to Amazon Redshift

1. Website
SenseiDB

Open-source, distributed, realtime, semi-structured database. Some Features: Full-text search, Fast realtime updates, Structured and faceted search, BQL: SQL-like query language, Fast key-value lookup, High performance under concurrent heavy update and query volumes, Hadoop integration

1. SenseiDB site
Sky

Sky is an open source database used for flexible, high performance analysis of behavioral data. For certain kinds of data such as clickstream data and log data, it can be several orders of magnitude faster than traditional approaches such as SQL databases or Hadoop.

1. SkyDB site
SymmetricDS

SymmetricDS is open source software for both file and database synchronization with support for multi-master replication, filtered synchronization, and transformation across the network in a heterogeneous environment. It supports multiple subscribers with one direction or bi-directional, asynchronous data replication. It uses web and database technologies to replicate data as a scheduled or near real-time operation. The software was designed to scale for a large number of nodes, work across low-bandwidth connections, and withstand periods of network outage. It works with most operating systems, file systems, and databases, including Oracle, MySQL, MariaDB, PostgreSQL, MS SQL Server (including Azure), IBM DB2, H2, HSQLDB, Derby, Firebird, Interbase, Informix, Greenplum, SQLite (including Android), Sybase ASE, and Sybase ASA (SQL Anywhere) databases.

1. SymmetricDS
Teradata Database

complete relational database management system

1. Website
VoltDB

in-memory NewSQL database

1. Website
Columnar Databases
Amazon RedShift

data warehouse service, based on PostgreSQL

1. Amazon RedShift
Apache Arrow

Powering Columnar In-Memory Analytics

1. Website
C-Store

column oriented DBMS

1. Website
Google BigQuery

framework for interactive analysis, implementation of Dremel

1. Google BigQuery
Google Dremel

framework for interactive analysis, implementation of Dremel

1. Dremel Paper
MonetDB

column store database

1. Website
Parquet

columnar storage format for Hadoop.

1. Parquet
Pivotal Greenplum

purpose-built, dedicated analytic data warehouse

1. Website
Vertica

The grid-based, column-oriented, Vertica Analytics Platform is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses and other query-intensive applications. The product claims to drastically improve query performance over traditional relational database systems, provide high-availability, and petabyte scalability on commodity enterprise servers.

1. Website
Time-Series Databases
Chronix

fast and efficient time series storage based on Apache Lucene and Apache Solr

1. Website
Cube

uses MongoDB to store time series data

1. Website
Etsy StatsD

simple daemon for easy stats aggregation

1. Website
InfluxDB

InfluxDB is an open source distributed time series database with no external dependencies. It’s useful for recording metrics, events, and performing analytics. It has a built-in HTTP API so you don’t have to write any server side code to get up and running. InfluxDB is designed to be scalable, simple to install and manage, and fast to get data in and out. It aims to answer queries in real-time. That means every data point is indexed as it comes in and is immediately available in queries that should return in

1. Website
Kairos

Time series data storage in Redis, Mongo, SQL and Cassandra

1. Website
Kairosdb

similar to OpenTSDB but allows for Cassandra

1. Website
OpenTSDB

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

1. OpenTSDB site
2. Website
Prometheus

an open-source service monitoring system and time series database

1. Website
Square Cube

system for collecting timestamped events and deriving metrics

1. Website
TempoIQ

Cloud-based sensor analytics

1. Website
SQL-like processing
Actian SQL for Hadoop

high performance interactive SQL access to all Hadoop data

1. Website
Adabas

ADABAS was NoSQL from a time when there was no SQL

1. Website
Akiban

Touted as SQL database with object structured storage

1. Website
AMPLAB Shark

Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Shark is built on top of Spark

1. AMPLAB on GitHub Shark
Apache Drill

Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need

1. Apache Drill
Apache HCatalog

HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. Right now HCatalog is part of Hive. Only old versions are separated for download.

1. Apache HCatalog
Apache Hive

Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.

1. Apache Hive
Apache Optiq

framework that allows efficient translation of queries involving heterogeneous and federated data

1. Website
Apache Phoenix

Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

1. Apache Phoenix site
BlinkDB

massively parallel, approximate query engine

1. BlinkDB
Brytlyt

a fully enabled GPGPU database which allows for offloading of database operations to General Processing on Graphics Processor Units.

1. Website
Cloudera Impala

The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. It’s a Google Dremel clone (Big Query google).

1. Website
2. Cloudera Impala
Concurrent Lingual

Open source project enabling fast and simple Big Data application development on Apache Hadoop. project that delivers ANSI-standard SQL technology to easily build new and integrate existing applications onto Hadoop

1. Cascading Lingual
Datasalt Splout SQL

Splout allows serving an arbitrarily big dataset with high QPS rates and at the same time provides full SQL query syntax.

1. Website
eBay Kylin

Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

1. Website
Facebook PrestoDB

Facebook has open sourced Presto, a SQL engine it says is on average 10 times faster than Hive for running queries across large data sets stored in Hadoop and elsewhere.

1. Facebook PrestoDB
Hadapt

a native implementation of SQL for the Apache Hadoop open-source project

1. Website
Hekaton

Refer to lock-free architecture for SQL Server 2014

1. Website
JethroData

index-based SQL engine for Hadoop

1. Website
Metanautix Quest

data compute engine

1. Website
Pivotal HAWQ

SQL-like data warehouse system for Hadoop

1. Pivotal HAWQ
RainstorDB

database for storing petabyte-scale volumes of structured and semi-structured data

1. Website
Spark Catalyst

Catalyst is a Query Optimization Framework for Spark and Shark

1. Github sub page
SparkSQL

Manipulating Structured Data Using Spark

1. Databricks blog post
Splice Machine

a full-featured SQL-on-Hadoop RDBMS with ACID transactions

1. Website
Stinger

interactive query for Hive

1. Stinger
Tajo

Tajo is a distributed data warehouse system on Hadoop that provides low-latency and scalable ad-hoc queries and ETL on large-data sets stored on HDFS and other data sources.

1. Tajo site
Trafodion

enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads

1. Website
Integrated Development Environments
R-Studio

IDE for R.

1. Website
Data Ingestion
Amazon Kinesis

Real-time processing of streaming data at massive scale

1. Amazon Kinesis
Amazon Snowball

a petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data into and out of AWS

1. Website
AMPLab SampleClean

scalable techniques for data cleaning and statistical inference on dirty data

1. Website
Apache BookKeeper

a distributed logging service called BookKeeper and a distributed publish/subscribe system built on top of BookKeeper called Hedwig

1. Website
Apache Chukwa

Large scale log aggregator, and analytics.

1. Apache Chukwa
Apache Flume

Un-structured data agregator to HDFS.

1. Apache Flume
Apache Samza

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Developed by http://www.linkedin.com/in/jaykreps Linkedin.

1. Apache Samza
Apache Sqoop

System for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS.

1. Apache Sqoop
Apache UIMA

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user

1. Website
Cloudera Morphlines

framework that help ETL to Solr, HBase and HDFS.

1. Website
Facebook Scribe

Log agregator in real-time. It’s a Apache Thrift Service.

1. Facebook Scribe
Fluentd

tool to collect events and logs

1. Fluentd
Google Photon

geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency

1. Website
Heka

open source stream processing software system.

1. Website
HIHO

This project is a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable. HIHO connects Hadoop with multiple RDBMS and file systems, so that data can be loaded to Hadoop and unloaded from Hadoop

1. Website
LinkedIn Camus

Kafka to HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka

1. Website
LinkedIn Databus

stream of change capture events for a database

1. LinkedIn Databus
LinkedIn Gobblin

a framework for Solving Big Data Ingestion Problem

1. Website
LinkedIn Kamikaze

utility package for compressing sorted integer arrays

1. LinkedIn Kamikaze
Linkedin Lumos

bridge from OLTP to OLAP for use it on Hadoop

1. Website
LinkedIn White Elephant

log aggregator and dashboard

1. LinkedIn White Elephant
Logstash

a tool for managing events and logs.

1. Website
Netflix Ribbon

a Inter Process Communication (remote procedure calls) library with built in software load balancers. The primary usage model involves REST calls with various serialization scheme support

1. Website
Netflix Suro

Suro has its roots in Apache Chukwa, which was initially adopted by Netflix. Is a log agregattor like Storm, Samza.

1. Website
Pinterest Secor

is a service implementing Kafka log persistance

1. Github
Record Breaker

Automatic structure for your text-formatted data

1. Website
Sawmill

extensive log processing and reporting features

1. Website
Stratio Ingestion

Apache Flume with steroids

1. Website
TIBCO Enterprise Message Service

standards-based messaging middleware

1. Website
Twitter Zipkin

distributed tracing system that helps us gather timing data for all the disparate services at Twitter

1. Website
Vibe Data Stream

streaming data collection for real-time Big Data analytics

1. Website
Message-oriented middleware
ActiveMQ

open source messaging and Integration Patterns server

1. Website
Amazon Simple Queue Service

fast, reliable, scalable, fully managed queue service

1. Website
Apache Kafka

Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the messages again. Another upside of the disk persistence is that bulk importing the data into HDFS for offline analysis can be done very quickly and efficiently. Storm, developed by BackType (which was acquired by Twitter a year ago), is more about transforming a stream of messages into new streams.

1. Apache Kafka
Apache Qpid

messaging tools that speak AMQP and support many languages and platforms

1. Website
Apcera NATS

an open-source, high-performance, lightweight cloud native messaging system

1. Website
Apollo

ActiveMQ’s next generation of messaging

1. Website
Azure Event Hubs

a highly scalable publish-subscribe event ingestor

1. Website
Beanstalkd

simple, fast work queue

1. Website
Bit.ly NSQ

realtime distributed message processing at scale

1. Website
2. Website
Celery

Distributed Task Queue

1. Website
Crossroads I/O

library for building scalable and high performance distributed applications

1. Website
Darner

simple, lightweight message queue

1. Website
Facebook Iris

a totally ordered queue of messaging updates with separate pointers into the queue indicating the last update sent to your Messenger app and the traditional storage tier

1. Website
Gearman

Job Server

1. Website
Google Cloud Pub/Sub

reliable, many-to-many, asynchronous messaging hosted on Google’s infrastructure

1. Website
Google Pub/Sub

reliable, many-to-many, asynchronous messaging hosted on Google’s infrastructure

1. Website
HornetQ

open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system

1. Website
IronMQ

easy-to-use highly available message queuing service

1. Website
Kestrel

distributed message queue system

1. Kestrel
Marconi

queuing and notification service made by and for OpenStack, but not only for it

1. Website
RabbitMQ

Robust messaging for applications

1. Website
RestMQ

message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources

1. Website
RQ

simple Python library for queueing jobs and processing them in the background with workers

1. Website
Sidekiq

Simple, efficient background processing for Ruby

1. Website
ZeroMQ

The Intelligent Transport Layer

1. Website
Service Programming
Akka Toolkit

Akka is an open-source toolkit and runtime simplifying the construction of concurrent applications on the Java platform.

1. Website
Apache Avro

Apache Avro is a framework for modeling, serializing and making Remote Procedure Calls (RPC). Avro data is described by a schema, and one interesting feature is that the schema is stored in the same file as the data it describes, so files are self-describing. Avro does not require code generation. This framework can compete with other similar tools like: Apache Thrift, Google Protocol Buffers, ZeroC ICE, and so on.

1. Apache Avro
Apache Curator

Curator is a set of Java libraries that make using Apache ZooKeeper much easier.

1. Website
Apache Karaf

Apache Karaf is an OSGi runtime that runs on top of any OSGi framework and provides you a set of services, a powerful provisioning concept, an extensible shell & more.

1. Website
Apache Thrift

A cross-language RPC framework for service creations. It’s the service base for Facebook technologies (the original Thrift contributor). Thrift provides a framework for developing and accessing remote services. It allows developers to create services that can be consumed by any application that is written in a language that there are Thrift bindings for. Thrift manages serialization of data to and from a service, as well as the protocol that describes a method invocation, response, etc. Instead of writing all the RPC code – you can just get straight to your service logic. Thrift uses TCP and so a given service is bound to a particular port.

1. Apache Thrift
Apache Zookeeper

It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on “Chubby”, a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.

1. Apache Zookeeper
2. Google Chubby paper
Google Chubby

a lock service for loosely-coupled distributed systems

1. Paper
Linkedin Norbert

Norbert is a library that provides easy cluster management and workload distribution. With Norbert, you can quickly distribute a simple client/server architecture to create a highly scalable architecture capable of handling heavy traffic. Implemented in Scala, Norbert wraps ZooKeeper, Netty and uses Protocol Buffers for transport to make it easy to build a cluster aware application. A Java API is provided and pluggable load balancing strategies are supported with round robin and consistent hash strategies provided out of the box.

1. Linedin Project
2. GitHub source code
MPICH

high performance and widely portable implementation of the Message Passing Interface (MPI) standard

1. Website
OpenMPI

message passing framework

1. OpenMPI
Serf

decentralized solution for service discovery and orchestration

1. Serf
Spotify Luigi

a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

1. Website
Spring XD

Spring XD (Xtreme Data) is a evolution of Spring Java application development framework to help Big Data Applications by Pivotal. SpringSource was the company created by the founders of the Spring Framework. SpringSource was purchased by VMware where it was maintained for some time as a separate division within VMware. Later VMware, and its parent company EMC Corporation, formally created a joint venture called Pivotal. Spring XD is more than development framework library, is a distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. It could be considered as alternative to Apache Flume/Sqoop/Oozie in some scenarios. Spring XD is part of Pivotal Spring for Apache Hadoop (SHDP). SHDP, integrated with Spring, Spring Batch and Spring Data are part of the Spring IO Platform as foundational libraries. Building on top of, and extending this foundation, the Spring IO platform provides Spring XD as big data runtime. Spring for Apache Hadoop (SHDP) aims to help simplify the development of Hadoop based applications by providing a consistent configuration and API across a wide range of Hadoop ecosystem projects such as Pig, Hive, and Cascading in addition to providing extensions to Spring Batch for orchestrating Hadoop based workflows.

1. Spring XD on GitHub
Twitter Elephant Bird

Elephant Bird is a project that provides utilities (libraries) for working with LZOP-compressed data. It also provides a container format that supports working with Protocol Buffers, Thrift in MapReduce, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea. This open source library is massively used in Twitter.

1. Elephant Bird GitHub
Twitter Finagle

Finagle is an asynchronous network stack for the JVM that you can use to build asynchronous Remote Procedure Call (RPC) clients and servers in Java, Scala, or any JVM-hosted language.

1. Website
Scheduling
AirBnB Airflow

AirFlow is a system to programmatically author, schedule and monitor data pipelines

1. Website
Apache Aurora

is a service scheduler that runs on top of Apache Mesos

1. Apache Incubator
Apache Falcon

Apache™ Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows. Instead of hard-coding complex data lifecycle capabilities, Hadoop applications can now rely on the well-tested Apache Falcon framework for these functions. Falcon’s simplification of data management is quite useful to anyone building apps on Hadoop. Data Management on Hadoop encompasses data motion, process orchestration, lifecycle management, data discovery, etc. among other concerns that are beyond ETL. Falcon is a new data processing and management platform for Hadoop that solves this problem and creates additional opportunities by building on existing components within the Hadoop ecosystem (ex. Apache Oozie, Apache Hadoop DistCp etc.) without reinventing the wheel.

1. Apache Falcon
Apache Oozie

Workflow scheduler system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availabilit

1. Apache Oozie
Chronos

distributed and fault-tolerant scheduler

1. Chronos
Linkedin Azkaban

Hadoop workflow management. A batch job scheduler can be seen as a combination of the cron and make Unix utilities combined with a friendly UI.

1. LinkedIn Azkaban
Pinterest Pinball

customizable platform for creating workflow managers

1. Website
Sparrow

Sparrow is a high throughput, low latency, and fault-tolerant distributed cluster scheduler. Sparrow is designed for applications that require resource allocations frequently for very short jobs, such as analytics frameworks. Sparrow schedules from a distributed set of schedulers that maintain no shared state. Instead, to schedule a job, a scheduler obtains intantaneous load information by sending probes to a subset of worker machines. The scheduler places the job’s tasks on the least loaded of the probed workers. This technique allows Sparrow to schedule in milliseconds, two orders of magnitude faster than existing approaches. Sparrow also handles failures: if a scheduler fails, a client simply directs scheduling requests to an alternate scheduler

1. Github
2. Paper
Machine Learning
Amazon Machine Learning

visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology

1. Website
AMPLab Splash

a general framework for parallelizing stochastic learning algorithms on multi-node clusters

1. Website
AMPLab Velox

a data management system for facilitating the next steps in real-world, large-scale analytics pipelines

1. Website
Apache Mahout

Machine learning library and math library, on top of MapReduce.

1. Apache Mahout
Ayasdi Core

tool for topological data analysis

1. Website
brain

Neural networks in JavaScript.

1. Website
Caffe

a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Cente

1. Website
Cloudera Oryx

The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommendation, classification / regression, and clustering.

1. Oryx at GitHub
2. Cloudera forum for Machine Learning
Concurrent Pattern

Machine Learning for Cascading on Apache Hadoop through an API, and standards based PMML

1. Cascading Pattern
convnetjs

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

1. Website
cuDNN

GPU-accelerated library of primitives for deep neural networks

1. Website
Decider

Flexible and Extensible Machine Learning in Ruby.

1. Website
DeepCL

OpenCL library to train deep convolutional neural networks

1. Website
etcML

text classification with machine learning

1.
Etsy Conjecture

Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL. The goal of this project is to enable the development of statistical models as viable components in a wide range of product settings. Applications include classification and categorization, recommender systems, ranking, filtering, and regression (predicting real-valued numbers). Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs. Integration with Hadoop and scalding enable seamless handling of extremely large data volumes, and integration with established ETL processes. Predicted labels can either be consumed directly by the web stack using the dataset loader, or models can be deployed and consumed by live web code. Currently, binary classification (assigning one of two possible labels to input data points) is the most mature component of the Conjecture package.

1. Github
Facebook DeepText

a deep learning-based text understanding engine that can understand with near-human accuracy the textual content of several thousands posts per second, spanning more than 20 languages

1. Website
Facebook FBLearner Flow

provides innovative functionality, like automatic generation of UI experiences from pipeline definitions and automatic parallelization of Python code using futures

1. Website
fbcunn

Deep Learning CUDA Extensions from Facebook AI Research

1. Website
Google DistBelief

software framework that can utilize computing clusters with thousands of machines to train large models

1. Website
Google Sibyl

System for Large Scale Machine Learning at Google

1. Website
2. Website
3. Website
Google TensorFlow

an Open Source Software Library for Machine Intelligence

1. Website
H2O

statistical, machine learning and math runtime for Hadoop

1. H2O
IBM Watson

cognitive computing system

1. Website
KeystoneML

Simplifying robust end-to-end machine learning on Apache Spark

1. Website
LinkedIn FeatureFu

contains a collection of library/tools for advanced feature engineering to derive features on top of other features, or convert a light weighted model into a feature

1. Website
LinkedIn ml-ease

ADMM based large scale logistic regression

1. Website
Microsoft Azure Machine Learning

is built on the machine learning capabilities already available in several Microsoft products including Xbox and Bing and using predefined templates and workflows

1. Website
Microsoft CNTK

Computational Network Toolkit

1. Website
MLbase

distributed machine learning libraries for the BDAS stack

1. MLbase
MLPNeuralNet

Fast multilayer perceptron neural network library for iOS and Mac OS X.

1. Website
Neon

a highly configurable deep learning framework

1. Website
nupic

Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.

1. Website
OpenAI Gym

a toolkit for developing and comparing reinforcement learning algorithms

1. Website
PredictionIO

machine learning server buit on Hadoop, Mahout and Cascading

1. PredictionIO
scikit-learn

scikit-learn: machine learning in Python.

1. Website
Seldon

an open source predictive analytics platform based upon Spark, Kafka and Hadoop

1. Website
Spark MLlib

a Spark implementation of some common machine learning (ML) functionality

1. Spark Documentation
Sparkling Water

combine H2OÕs Machine Learning capabilities with the power of the Spark platform

1. Website
2. Website
SparkNet

Distributed Neural Networks for Spark

1. Website
Theano

Python package for deep learning that can utilize NVIDIA’s CUDA toolkit to run on the GPU

1. Website
Thunder

Large-scale analysis of neural data

1. Website
Vahara

Machine learning and natural language processing with Apache Pig

1. Website
Velox

a system for serving machine learning predictions

1. Website
Viv

global platform that enables developers to plug into and create an intelligent, conversational interface to anything

1. Website
Vowpal Wabbit

learning system sponsored by Microsoft and Yahoo!

1. Vowpal Wabbit
WEKA

Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is free software available under the GNU General Public License.

1. Website
Wit

Natural Language for the Internet of Things

1. Website
Wolfram Alpha

computational knowledge engine

1. Website
YHat ScienceOps

platform for deploying, managing, and scaling predictive models in production applications

1. Website
Benchmarking
Apache Hadoop Benchmarking

There are two main JAR files in Apache Hadoop for benchmarking. This JAR are micro-benchmarks for testing particular parts of the infrastructure, for instance TestDFSIO analyzes the disk system, TeraSort evaluates MapReduce tasks, WordCount measures cluster performance, etc. Micro-Benchmarks are packaged in the tests and exmaples JAR files, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments. With regards Apache Hadoop 2.2.0 stable version we have available the following JAR files for test, examples and benchmarking. The Hadoop micro-benchmarks, are bundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar, hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.

1. MAPREDUCE-3561 umbrella ticket to track all the issues related to performance
Berkeley SWIM Benchmark

The SWIM benchmark (Statistical Workload Injector for MapReduce), is a benchmark representing a real-world big data workload developed by University of California at Berkley in close cooperation with Facebook. This test provides rigorous measurements of the performance of MapReduce systems comprised of real industry workloads.

1. GitHub SWIN
Big-Bench

Big Bench Workload Development

1. Website
Hive-benchmarks

some benchmarking queries for Apache Hive

1. Website
Hive-testbench

Testbench for experimenting with Apache Hive at any data scale.

1. Website
Intel HiBench

HiBench is a Hadoop benchmark suite.

1. Website
Mesosaurus

Mesos task load simulator framework for (cluster and Mesos) performance analysis

1. Website
Netflix Inviso

performance focused Big Data tool

1. Website
PUMA Benchmarking

Benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count, and Grep are from Hadoop distribution. The rest of the benchmarks were developed in-house and are currently not part of the Hadoop distribution. The three benchmarks from Hadoop distribution are also slightly modified to take number of reduce tasks as input from the user and generate final time completion statistics of jobs.

1. MAPREDUCE-5116
2. Faraz Ahmad researcher
3. PUMA Docs
Yahoo Gridmix3

Hadoop cluster benchmarking from Yahoo engineer team.

1. Website
Security
Apache Knox Gateway

System that provides a single point of secure access for Apache Hadoop clusters. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.

1. Website
Apache Ranger

framework to enable, monitor and manage comprehensive data security across the Hadoop platform (formerly called Apache Argus)

1. Website
Apache Sentry

Sentry is the next step in enterprise-grade big data security and delivers fine-grained authorization to data stored in Apache Hadoop™. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise data sets. Sentry was a Cloudera development.

1. Website
PacketPig

Open Source Big Data Security Analytics

1. Website
Voltage SecureData

data protection framework

1. Website
System Deployment
Ankush

A big data cluster management tool that creates and manages clusters of different technologies.

1. Website
Apache Ambari

Intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Apache Ambari was donated by Hortonworks team to the ASF. It’s a powerful and nice interface for Hadoop and other typical applications from the Hadoop ecosystem. Apache Ambari is under a heavy development, and it will incorporate new features in a near future. For example Ambari is able to deploy a complete Hadoop system from scratch, however is not possible use this GUI in a Hadoop system that is already running. The ability to provisioning the operating system could be a good addition, however probably is not in the roadmap..

1. Apache Ambari
Apache Bigtop

Bigtop was originally developed and released as an open source packaging infrastructure by Cloudera. BigTop is used for some vendors to build their own distributions based on Apache Hadoop (CDH, Pivotal HD, Intel’s distribution), however Apache Bigtop does many more tasks, like continuous integration testing (with Jenkins, maven, …) and is useful for packaging (RPM and DEB), deployment with Puppet, and so on. Apache Bigtop could be considered as a community effort with a main focus: put all bits of the Hadoop ecosystem as a whole, rather than individual projects.

1. Apache Bigtop.
Apache Helix

Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Originally developed by Linkedin, now is in an incubator project at Apache. Helix is developed on top of Zookeeper for coordination tasks. .

1. Apache Helix
Apache Mesos

Mesos is a cluster manager that provides resource sharing and isolation across cluster applications. Like HTCondor, SGE or Troque can do it. However Mesos is hadoop centred design

1. Apache Mesos
Apache Slider

Slider is a YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the cluster is running.

1. Gihub page
Apache Whirr

Apache Whirr is a set of libraries for running cloud services. It allows you to use simple commands to boot clusters of distributed systems for testing and experimentation. Apache Whirr makes booting clusters easy.

1. Apache Whirr
Apache YARN

Apache Hadoop YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.

1. Apache YARN
Brooklyn

brooklyn is a library that simplifies application deployment and management. For deployment, it is designed to tie in with other tools, giving single-click deploy and adding the concepts of manageable clusters and fabrics: Many common software entities available out-of-the-box. Integrates with Apache Whirr – and thereby Chef and Puppet – to deploy well-known services such as Hadoop and elasticsearch (or use POBS, plain-old-bash-scripts) Use PaaS’s such as OpenShift, alongside self-built clusters, for maximum flexibility

1. Github
Buildoop

Buildoop is an open source project licensed under Apache License 2.0, based on Apache BigTop idea. Buildoop is a collaboration project that provides templates and tools to help you create custom Linux-based systems based on Hadoop ecosystem. The project is built from scrach using Groovy language, and is not based on a mixture of tools like BigTop does (Makefile, Gradle, Groovy, Maven), probably is easier to programming than BigTop, and the desing is focused in the basic ideas behind the buildroot Yocto Project. The project is in early stages of development right now.

1. Buildoop
Cloudera Director

a comprehensive data management platform with the flexibility and power to evolve with your business

1. Website
Cloudera HUE

Web application for interacting with Apache Hadoop.

1. Website
CloudPhysics

collect operational metadata from your virtualized infrastructure, then correlate and analyze it to expose operational hazards and waste that pose a threat to your datacenter performance, efficiency and uptime

1. Website
Deimos

Mesos containerizer hooks for Docker

1. Website
Develoop

tool for provisioning, managing and monitoring Apache Hadoop

1. Website
Etsy Sahale

Visualizing Cascading Workflows at Etsy

1. Website
Facebook Autoscale

the load balancer will concentrate workload to a server until it has at least a medium-level workload

1. Website
Facebook Prism

multi datacenters replication system

1. Website
Ganglia Monitoring System

scalable distributed monitoring system for high-performance computing systems such as clusters and Grids

1. Website
Genie

Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.

1. Website
Google Borg

job scheduling and monitoring system

1. Wired article
Google Omega

job scheduling and monitoring system

1. Talk
Hannibal

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

1. Website
Hortonworks HOYA

HOYA is defined as “running HBase On YARN”. The Hoya tool is a Java tool, and is currently CLI driven. It takes in a cluster specification – in terms of the number of regionservers, the location of HBASE_HOME, the ZooKeeper quorum hosts, the configuration that the new HBase cluster instance should use and so on.

1. Hortonworks Blog
Jumbune

Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.

1. Website
2. Github
Marathon

Marathon is a Mesos framework for long-running services. Given that you have Mesos running as the kernel for your datacenter, Marathon is the init or upstart daemon.

1. Website
Minotaur

scripts/recipes/configs to spin up VPC-based infrastructure in AWS from scratch and deploy labs to it

1. Website
Myriad

a mesos framework designed for scaling YARN clusters on Mesos. Myriad can expand or shrink one or more YARN clusters in response to events as per configured rules and policies.

1. Website
Neflix SimianArmy

a suite of tools for keeping your cloud operating in top form

1. Website
Netflix Eureka

AWS Service registry for resilient mid-tier load balancing and failover

1. Website
Netflix Hystrix

a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable

1. Website
Scaling Data

tracing data center problems to root cause, predict capacity issues, identify emerging failures and highlight latent threats

1. Website
Stratio Manager

install, manage and monitor all the technology stack related to the Stratio Platform

1. Website
Tumblr Collins

Infrastructure management for engineers

1. Website
Tumblr Genesis

a tool for data center automation

1. Website
Container Manager
Amazon EC2 Container Service

a highly scalable, high performance container management service that supports Docker containers

1. Website
CoreOS Fleet

cluster management tool from CoreOS

1. Website
Docker

an open platform for developers and sysadmins to build, ship, and run distributed applications

1. Website
Docker Swarm

native clustering for Docker

1. Website
Fig

fast, isolated development environments using Docker

1. Website
Google Container Engine

Run Docker containers on Google Cloud Platform, powered by Kubernetes

1. Website
HashiCorp Nomad

a Distributed, Highly Available, Datacenter-Aware Scheduler

1. Website
Kubernetes

open source implementation of container cluster management

1. Website
Pumba

Chaos testing tool for Docker

1. Website
Rocket

an alternative to the Docker runtime, designed for server environments with the most rigorous security and production requirements

1. Website
Applications
Adobe Spindle

Next-generation web analytics processing with Scala, Spark, and Parquet

1. Website
Apache Kiji

Build Real-time Big Data Applications on Apache HBase.

1. Website
Apache Nutch

Highly extensible and scalable open source web crawler software project. A search engine based on Lucene: A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly.

1. Website
Apache OODT

OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA’s scientific archives

1. Website
Apache Tika

Toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

1. Apache Tika
Domino

Run, scale, share, and deploy models Ñ without any infrastructure.

1. Website
Eclipse BIRT

BIRT is an open source Eclipse-based reporting system that integrates with your Java/Java EE application to produce compelling reports.

1. Website
Eventhub

open source event analytics platform.

1. Website
HIPI Library

HIPI is a library for Hadoop’s MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment.

1. Website
Hunk

Splunk analytics for Hadoop

1. Hunk
MADlib

The MADlib project leverages the data-processing capabilities of an RDBMS to analyze data. The aim of this project is the integration of statistical data analysis into databases. The MADlib project is self-described as the Big Data Machine Learning in SQL for Data Scientists. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (now Pivotal)

1. MADlib Community
PivotalR

PivotalR is a package that enables users of R, the most popular open source statistical programming language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal HD / HAWQ and the open-source database PostgreSQL for Big Data analytics. R is a programming language and data analysis software: you do data analysis in R by writing scripts and functions in the R programming language. R is a complete, interactive, object-oriented language: designed by statisticians, for statisticians. The language provides objects, operators and functions that make the process of exploring, modeling, and visualizing data a natural one.

1. Website
Qubole

auto-scaling Hadoop cluster, built-in data connectors.

1. Website
Sense

Cloud Platform for Data Science and Big Data Analytics

1. Website
Snowplow

enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.

1. Website
SparkR

R frontend for Spark

1. AMPLab extras
Splunk

analyzer for machine-generated date

1. Splunk
Talend

Talend is an open source software vendor that provides data integration, data management, enterprise application integration and big data software and solutions.

1. Website
Search engine and framework
Algolia

Hosted Search API that delivers instant and relevant results from the first keystroke

1. Website
Apache Blur

a search engine capable of querying massive amounts of structured data at incredible speeds

1. Website
Apache Lucene

Search engine library

1. Apache Lucene
Apache Solr

Search platform for Apache Lucene

1. Apache Solr
ElasticSearch

Search and analytics engine based on Apache Lucene

1. ElasticSearch
Elasticsearch Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.

1. Website
Enigma.io

Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web

1. Website
Facebook Unicorn

social graph search platform

1. Website
Google Caffeine

continuous indexing system

1. Google blog post
Google Percolator

continuous indexing system

1. Paper
TeraGoogle

large search index

1.
Haeinsa

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase. Use Haeinsa if you need strong ACID semantics on your HBase cluster. Is based on Google Perlcoator concept.

1. Website
HBase Coprocessor

implementation of Percolator, part of HBase

1. HBase Coprocessor
hIndex

Secondary Index for HBase

1. Website
SF1R Search Engine

distributed high performance massive data engine for enterprise/vertical search

1. SF1R
Lily HBase Indexer

quickly and easily search for any content stored in HBase

1. Website
LinkedIn Bobo

is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.

1. Github Page
LinkedIn Cleo

Cleo is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search. It is suitable for data sets of varying sizes and types. Cleo has been used extensively to power LinkedIn typeahead search covering professional network connections, companies, groups, questions, skills and other site features.

1. Github
LinkedIn Galene

search architecture at LinkedIn

1. Blog post on LinkedIn engineer
LinkedIn Zoie

Zoie is a realtime search/indexing system written in Java.

1. Github
Sphinx Search Server

Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.

1. Sphinx
MySQL forks and evolutions
Amazon Aurora

a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases

1. Website
Amazon RDS

MySQL databases in Amazon’s cloud

1. Amazon RDS
BigObject

Real-time Computing Engine Designed for Big Data

1. Website
Drizzle

Drizzle is a re-designed version of the MySQL v6.0 codebase and is designed around a central concept of having a microkernel architecture. Features such as the query cache and authentication system are now plugins to the database, which follow the general theme of “pluggable storage engines” that were introduced in MySQL 5.1. It supports PAM, LDAP, and HTTP AUTH for authentication via plugins it ships. Via its plugin system it currently supports logging to files, syslog, and remote services such as RabbitMQ and Gearman. Drizzle is an ACID-compliant relational database that supports transactions via an MVCC design

1. Website
Galera Cluster

a synchronous multi-master cluster for MySQL, Percona and MariaDB

1. Galera Cluster Homepage
Google Cloud SQL

MySQL databases in Google’s cloud

1. Google Cloud SQL
HiveDB

an open source framework for horizontally partitioning MySQL systems

1. Website
MariaDB

enhanced, drop-in replacement for MySQL

1. MariaDB
MySQL Cluster

MySQL implementation using NDB Cluster storage engine providing shared-nothing clustering and auto-sharding

1. MySQL Cluster
Percona Server

enhanced, drop-in replacement for MySQL

1. Percona Server
ProxySQL

High Performance Proxy for MySQL

1. ProxySQL
TiDB

a distributed SQL database inspired by the design of Google F1

1. Website
TokuDB

TokuDB is a storage engine for MySQL and MariaDB that is specifically designed for high performance on write-intensive workloads. It achieves this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC compliant storage engine. TokuDB is one of the technologies that enable Big Data in MySQL.

1. Website
WebScaleSQL

is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale, and seek greater performance from a database technology tailored for their needs.

1. Website
Youtube Vitess

provides servers and tools which facilitate scaling of MySQL databases for large scale web services

1. Website
PostgreSQL forks and evolutions
HadoopDB

hybrid of MapReduce and DBMS

1. HadoopDB
IBM Netezza

high-performance data warehouse appliances

1. Website
Postgres-XL

Scalable Open Source PostgreSQL-based Database Cluster

1. Website
RecDB

Open Source Recommendation Engine Built Entirely Inside PostgreSQL

1. Website
Stado

open source MPP database system solely targeted at data warehousing and data mart applications

1. Website
Yahoo Everest

multi-peta-byte database / MPP derived by PostgreSQL

1. Website
Memcached forks and evolutions
Box Tron

proxy to memcached servers

1. Website
Facebook McDipper

key/value cache for flash storage

1. Facebook McDipper
Facebook Mcrouter

a memcached protocol router for scaling memcached deployments

1. Website
2. Facebook Note
Facebook Memcached

fork of Memcache

1. Facebook Memcached
Twemproxy

A fast, light-weight proxy for memcached and redis

1. Github
Twitter Fatcache

key/value cache for flash storage

1. Twitter Fatcache
Twitter Twemcache

fork of Memcache

1. Twitter Twemcache
Embedded Databases
Actian PSQL

ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications

1. Website
BerkeleyDB

a software library that provides a high-performance embedded database for key/value data

1. Oracle website
eXtreme DB

in-memory database combines exceptional performance, reliability and developer efficiency in a proven real-time embedded database engine

1. Website
FairCom c-treeACE

a cross-platform database engine

1. Website
Google Firebase

a powerful API to store and sync data in realtime

1. Website
HamsterDB

transactional key-value database

1. Website
HanoiDB

HanoiDB implements an indexed, key/value storage engine. The primary index is a log-structured merge tree (LSM-BTree) implemented using ‘doubling sizes’ persistent ordered sets of key/value pairs, similar is some regards to LevelDB. HanoiDB includes a visualizer which when used to watch a living database resembles the ‘Towers of Hanoi’ puzzle game, which inspired the name of this database.

1. Github
LevelDB

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

1. Google code website
LMDB

ultra-fast, ultra-compact key-value embedded data store developed by Symas

1. Symas website
RocksDB

RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.

1. RocksDB site
TokioCabinet

a library of routines for managing a database

1. Website
UnQLite

a in-process software library which implements a self-contained, serverless, zero-configuration, transactional NoSQL database engine

1. Website
Business Intelligence
ActivePivot

Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing

1. Website
Adatao

business intelligence and data science platform

1. Website
Amazon QuickSight

Business Intelligence for Big Data

1. Website
Apama analytics

platform for streaming analytics and intelligent automated action

1. Website
Atigeo xPatterns

data analytics platform

1. Website
BIME Analytics

business intelligence platform in the cloud

1. Website
Chartio

lean business intelligence platform to visualize and explore your data.

1. Website
Datapine

self-service business intelligence tool in the cloud

1. Website
Jaspersoft

powerful business intelligence suite.

1. Website
Jedox Palo

Palo Suite combines all core applications — OLAP Server, Palo Web, Palo ETL Server and Palo for Excel — into one comprehensive and customisable Business Intelligence platform. The platform is completely based on Open Source products representing a high-end Business Intelligence solution which is available entirely free of any license fees.

1. Website
Lavastorm Analytics

used for audit analytics, revenue assurance, fraud management, and customer experience management

1. Website
LinkedIn GoSpeed

provides RUM data processing, visualization, monitoring, and analyses data daily, hourly, or on a near real-time basis

1. Website
Map-D

GPU in-memory database, big data analysis and visualization platform

1. Website
Microsoft

business intelligence software and platform.

1. Website
Microstrategy

software platforms for business intelligence, mobile intelligence, and network applications.

1. Website
Pentaho

business intelligence platform.

1. Website
Qlik

business intelligence and analytics platform.

1. Website
SpagoBI

SpagoBI is an Open Source Business Intelligence suite, belonging to the free/open source SpagoWorld initiative, founded and supported by Engineering Group. It offers a large range of analytical functions, a highly functional semantic layer often absent in other open source platforms and projects, and a respectable set of advanced data visualization features including geospatial analytics

1. Website
Spotfire

business intelligence platform

1. Website
Stratio Explorer

an Interactive Web interpreter to Apache Crossdata, Stratio Ingestion, Stratio Decision,Markdown, Apache Spark, Apache Spark-SQL and command Shell

1. Website
Tableau

business intelligence platform.

1. Website
Teradata Aster

Big Data Analytics

1. Website
Tessera

Environment for Deep Analysis of Large Complex Data

1. Website
Zeppelin

open source data analysis environment on top of Hadoop.

1. Website
Zoomdata

Big Data Analytics

1. Website
Data Analysis
Apache Zeppelin

a web-based notebook that enables interactive data analytics

1. Website
Datameer

data analytics application for Hadoop combines self-service data integration, analytics and visualization

1. Website
Ibis

Python big data analysis framework for high performance at Hadoop-scale, with first-class integration with Impala

1. Website
LinkedIn Pinot

a distributed system that supports columnar indexes with the ability to add new types of indexes

1. Website
Microsoft Cortana Analytics

a fully managed big data and advanced analytics suite that enables you to transform your data into intelligent action.

1. Website
Myria

scalable Analytics-as-a-Service platform based on relational algebra

1. Website
Periscope

plugs directly into your databases and lets you run, save, and share analyses over billions of data rows in seconds

1. Website
Pinalytics

Pinterest’s data analytics engine

1. Website
Shiny

web application framework for R

1. Website
Stratio Sparkta

real time monitoring

1. Website
Tamr

standalone tool to catalog all of your enterprise metadata

1. Website
Zaloni Bedrock

fully integrated Hadoop data management platform

1. Website
Zaloni Mica

self-service data discovery, curation, and governance

1. Website
Zillabyte

an API for distributed data computation. Scale with your data.

1. Website
Data Warehouse
Google Mesa

highly scalable analytic data warehousing system

1. Website
IBM BigInsights

data processing, warehousing and analytics

1. Website
IBM dashDB

Data Warehousing and Analysis Needs, all in the Cloud

1. Website
Microsoft Azure SQL Data Warehouse

businesses access to an elastic petabyte-scale, data warehouse-as-a-service offering that can scale according to their needs

1. Website
Microsoft Cosmos

Microsoft’s internal BigData analysis platform

1. Website
Data Visualization
Arbor

graph visualization library using web workers and jQuery.

1. Website
C3

D3-based reusable chart library

1. Website
CartoDB

open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API

1. Website
Chart.js

open source HTML5 Charts visualizations.

1. Website
Chartist.js

another open source HTML5 Charts visualization

1. Website
Crossfilter

avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js

1. Website
Cubism

JavaScript library for time series visualization.

1. Website
Cytoscape

open source software platform for visualizing complex networks and integrating these with any type of attribute data

1. Website
2. Website
D3

javaScript library for manipulating documents.

1. Website
DC.js

Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3

1. Website
Envisionjs

dynamic HTML5 visualization.

1. Website
FnordMetric ChartSQL

allows you to write SQL queries that return charts instead of tables.

1. FnordMetric ChartSQL
Freeboard

open source real-time dashboard builder for IOT and other web mashups.

1. Website
Gephi

An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It’s like Photoshop, but for graphs. Available for Windows and Mac OS X.

1. Website
Google Charts

simple charting API.

1. Website
Grafana

open source, feature rich metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB

1. Website
Graphistry

running on GPUs and turns static designs into interactive tools using client/cloud GPU infrastructure and GPU-accelerated languages like Superconductor

1. Website
Graphite

scalable Realtime Graphing.

1. Website
Highcharts

simple and flexible charting API.

1. Website
IPython

provides a rich architecture for interactive computing

1. Website
Keylines

toolkit for visualizing the networks in your data

1. Website
Kibana

visualize logs and time-stamped data

1. Website
Matplotlib

plotting with Python.

1. Website
Microsoft SandDance

visually explore data sets to find stories and extract insights

1. Website
NVD3

chart components for d3.js.

1. Website
Peity

Progressive SVG bar, line and pie charts.

1. Website
Plot.ly

Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly’s online spreadsheet. Fork others’ plots.

1. Website
Recline

simple but powerful library for building data applications in pure Javascript and HTML.

1. Website
Redash

open-source platform to query and visualize data.

1. Website
Sigma.js

JavaScript library dedicated to graph drawing.

1. Website
Square Cubism.js

aÊD3Êplugin for visualizing time series. Use Cubism to construct better realtime dashboards, pulling data fromÊGraphite,ÊCubeÊand other sources

1. Website
Stratio Viewer

dashboarding tool

1. Website
Vega

a visualization grammar.

1. Website
Internet of Things
2lemetry

Platform for Internet of things

1. Website
Evrything

Making products smart

1. Website
ThingWorx

Rapid development and connection of intelligent systems

1. Website

Papers

1997
1997 Application-Controlled Demand Paging for Out-of-Core Visualization
1999
1999 Pasting Small Votes for Classification in Large Databases and On-Line
1999 The PageRank Citation Ranking: Bringing Order to the Web
2001
2001 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
2001 Paxos Made Simple
2001 Random Forrest
2002
2002 Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
2003
2003 Interpreting the Data: Parallel Analysis with Sawzall
2003 The Google File System
2004
2004 Cheap Paxos
2004 MapReduce: Simplified Data Processing on Large Clusters
2005
2005 Fast Paxos
2006
2006 Bigtable: A Distributed Storage System for Structured Data
2006 Ceph: A Scalable, High-Performance Distributed File System
2006 Map-Reduce for Machine Learning on Multicore
2006 The Chubby lock service for loosely-coupled distributed systems
2007
2007 Architecture of a Database System
2007 Consistent Streaming Through Time: A Vision for Event Stream Processing
2007 Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
2007 Dynamo: Amazon's Highly Available Key-value Store
2007 Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments
2007 Life beyond Distributed Transactions: an ApostateÕs Opinion
2007 Paxos Made Live - An Engineering Perspective
2008
2008 Chukwa: A large-scale monitoring system
2008 Column:Stores vs. Row-Stores- How Different Are They Really?
2008 PNUTS: Yahoo!Õs Hosted Data Serving Platform
2008 Top 10 algorithms in data mining
2009
2009 Cassandra - A Decentralized Structured Storage System
2009 Feature Hashing for Large Scale Multitask Learning
2009 HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
2009 Vertical Paxos and Primary-Backup Replication
2010
2010 A Method of Automated Nonparametric Content Analysis for Social Science
2010 Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
2010 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
2010 Dremel: Interactive Analysis of Web-Scale Datasets
2010 Finding a needle in Haystack- Facebook's photo storage
2010 FlumeJava: Easy, Eff¥cient Data-Parallel Pipelines
2010 Large:scale Incremental Processing Using Distributed Transactions and Notifications
2010 Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
2010 Pregel: A System for Large-Scale Graph Processing
2010 S4: Distributed Stream Computing Platform
2010 Spark: Cluster Computing with Working Sets
2010 The Learning Behind Gmail Priority Inbox
2010 ZooKeeper: Wait-free coordination for Internet-scale systems
2011
2011 Consistency, Availability, and Convergence
2011 CrowdDB: Answering Queries with Crowdsourcing
2011 CrowdDB: Query Processing with the VLDB Crowd
2011 Fast Crash Recovery in RAMCloud
2011 Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
2011 It's Time for Low Latency
2011 Matching Unstructured Product Offers to Structured Product Specifications
2011 Megastore: Providing Scalable, Highly Available Storage for Interactive Services
2011 Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing
2011 Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters
2012
2012 A Few Useful Things to Know about Machine Learning
2012 A Sublinear Time Algorithm for PageRank Computations
2012 Avatara: OLAP for Web-scale Analytics Products
2012 Blink and It's Done. Interactive Queries on Very Large Data
2012 BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
2012 Building high-level features using large scale unsupervised learning
2012 Dimension Independent Similarity Computation
2012 Earlybird: Real-Time Search at Twitter
2012 Fast and Interactive Analytics over Hadoop Data with Spark
2012 HyperDex: A Distributed, Searchable Key-Value Store
2012 ImageNet Classification with Deep Convolutional Neural Networks
2012 Large Scale Distributed Deep Networks
2012 Large:Scale Machine Learning at Twitter
2012 Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
2012 Paxos Made Parallel
2012 Paxos Replicated State Machines as the Basis of a High-Performance Data Store
2012 Perspectives on the CAP Theorem
2012 Processing a Trillion Cells per Mouse Click
2012 Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
2012 Spanner: Google's Globally-Distributed Database
2012 Temporal Analytics on Big Data for Web Advertising
2012 The Unified Logging Infrastructure for Data Analytics at Twitter
2012 The Vertica Analytic Database- C-Store 7 Years Later
2013
2013 A Demonstration of SpatailHadoop: An Efficient MapReduce Framework for Spatial Data
2013 A Lightweight and High Performance Monolingual Word Aligner
2013 Answer Extraction as Sequence Tagging with Tree Edit Distance
2013 Automatic Coupling of Answer Extraction and Information Retrieval
2013 CG_Hadoop: Computational Geometry in MapReduce
2013 Consistency-Based Service Level Agreements for Cloud Storage
2013 Dimension Independent Matrix Square using MapReduce
2013 Druid A Real-time Analytical Data Store
2013 Efficient Estimation of Word Representations in Vector Space
2013 Event labeling combining ensemble detectors and background knowledge
2013 Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask
2013 F1: A Distributed SQL Database That Scales
2013 Fast Training of Convolutional Networks through FFTs
2013 GraphX: A Resilient Distributed Graph System on Spark
2013 HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality 2013 Estimation Algorithm
2013 MillWheel: Fault-Tolerant Stream Processing at Internet Scale
2013 MLbase: A Distributed Machine-learning System
2013 Naiad: A Timely Dataflow System
2013 Omega: flexible, scalable schedulers for large compute clusters
2013 Online, Asynchronous Schema Change in F1
2013 Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
2013 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
2013 Rich feature hierarchies for accurate object detection and semantic segmentation
2013 Scalable Progressive Analytics on Big Data in the Cloud
2013 Scaling Memcache at Facebook
2013 Scuba: Diving into Data at Facebook
2013 Semi-Markov Phrase-based Monolingual Alignment
2013 Shark: SQL and Rich Analytics at Scale
2013 Some Improvements on Deep Convolutional Neural Network Based Image Classification
2013 Sparrow: Distributed, Low Latency Scheduling
2013 Sparrow: Scalable Scheduling for Sub-Second Parallel Jobs
2013 TAO: Facebook’s Distributed Data Store for the Social Graph
2013 Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code
2013 Unicorn: A System for Searching the Social Graph
2013 Warp: Lightweight Multi-Key Transactions for Key-Value Stores
2014
2014 3D Object Manipulation in a Single Photograph using Stock 3D Models
2014 A Partitioning Framework for Aggressive Data Skipping
2014 A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
2014 A Self-Configurable Geo-Replicated Cloud Storage System
2014 All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
2014 Arrakis: The Operating System is the Control Plane
2014 Automatic Construction of Inference-Supporting Knowledge Bases
2014 Bayesian group latent factor analysis with structured sparse priors
2014 Chinese Open Relation Extraction for Knowledge Acquisition
2014 Coordination Avoidance in Database Systems
2014 DeepFace: Closing the Gap to Human-Level Performance in Face Verification
2014 Diagram Understanding in Geometry Questions
2014 Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
2014 Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
2014 Eidetic Systems
2014 Execution Primitives for Scalable Joins and Aggregations in Map Reduce
2014 Extracting More Concurrency from Distributed Transactions
2014 f4: Facebook’s Warm BLOB Storage System
2014 Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
2014 Fastpass: A Centralized "Zero-Queue" Datacenter Network
2014 First-person Hyper-lapse Videos
2014 GloVe: Global Vectors for Word Representation
2014 GraphX: Graph Processing in a Distributed Dataflow Framework
2014 Guess Who Rated This Movie: Identifying Users Through Subspace Clustering
2014 In Search of an Understandable Consensus Algorithm
2014 Learning Everything about Anything: Webly-Supervised Visual Concept Learning
2014 Learning to Solve Arithmetic Word Problems with Verb Categorization
2014 Log-structured Memory for DRAM-based Storage
2014 Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
2014 MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
2014 Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
2014 Modeling Biological Processes for Reading Comprehension
2014 Orca A Modular Query Optimizer Architecture for Big Data
2014 Pigeon: A Spatial MapReduce Language
2014 Project Adam: Building an Efficient and Scalable Deep Learning Training System
2014 Quantum Deep Learning
2014 R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
2014 Salt: Combining ACID and BASE in a Distributed Database
2014 Scalable Object Detection using Deep Neural Networks
2014 Sequence to Sequence Learning with Neural Networks
2014 Show and Tell: A Neural Image Caption Generator
2014 Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
2014 The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
2014 The Trill Incremental Analytics Engine
2015
2015 A Neural Algorithm of Artistic Style
2015 Deep Image: Scaling up Image Recognition
2015 Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
2015 Deep Speech: Scaling up end-to-end speech recognition
2015 Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
2015 G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
2015 Giraffe: Using Deep Reinforcement Learning to Play Chess
2015 Hidden Technical Debt in Machine Learning Systems
2015 Klout Score: Measuring Influence Across Multiple Social Networks
2015 Large-scale cluster management at Google with Borg
2015 Machine Learning Classification over Encrypted Data
2015 Machine Learning Methods for Computer Security
2015 Neural Networks with Few Multiplications
2015 Self-Repairing Disk Arrays
2015 Spark SQL: Relational Data Processing in Spark
2015 SparkNetwork: Training Deep Network in Spark
2015 Succinct: Enabling Queries on Compressed Data
2015 Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms
2015 The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox
2015 Trill: A High-Performance Incremental Query Processor for Diverse Analytics
2015 Twitter Heron: Stream Processing at Scale
2016
2016 Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup
2016 Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples
2016 Understanding Deep Convolutional Networks

Related projects