Categories
Cassandra Scalability

Creating Multiple Local Cassandra Datacenters with Docker

The Apache Cassandra database has long included support for multiple datacenters. That is, Cassandra allows an organization to configure a cluster to actively store data across disparate datacenters. There are multiple reasons do this, including: Improve geographic responsiveness Disaster recovery Separation of transactional and analytic workloads In general, when trying to wrap my head around […]

Categories
Kafka

Creating Custom Kafka Partitioners

If you’ve used Kafka, you’ve likely heard about partitions. Kafka allows you to partition the data in a given topic so that the processing work can be divided among multiple nodes. Thus partitioning of the data allows more data to be processed in parallel. Kafka’s default logic will attempt to evenly distribute messages into the […]

Categories
Scalability

What is HTAP?

I’ve begun seeing the database-related acronym HTAP thrown about more and more, so I did a little research to understand its meaning and implications. The acronym HTAP has been established by Gartner (by Wikipedia) as follows: Hybrid transaction/analytical processing (HTAP) is an emerging application architecture that “breaks the wall” between transaction processing and analytics. It […]

Categories
Cassandra Scalability

An Example Using Cassandra With Zipkin

Zipkin is a system for tracing, viewing, and troubleshooting distributed systems and microservice-based applications. Here’s the description from the Zipkin website: Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. Zipkin’s design is based on the […]

Categories
Hadoop Scalability

How Big is Your Elephant?

In order to augment some research we’re doing, we’d like to determine the sizes of Hadoop clusters that organizations are commonly deploying. Why? Well, organizations with 100+ nodes have different needs than those with five, for example.

Categories
Configuration Hadoop Hive Scalability

Give Your MySQL Account Access to Hive

One area of the Apache Hive documentation that’s not entirely explicit is in regard to the database privileges needed for its metastore[1]. Developers often become accustomed to creating a database account that has all privileges granted. But in the Real World, end users of Hive must configure it to point to a metastore RDBMS account […]

Categories
Miscellaneous Scalability

SSDs in the Data Center — Is $/GB/IOPS the Only Relevant Metric?

Wikia’s Artur Bergman recently gave a talk at Velocity about SSD adoption that has generated a lot of buzz. The video can be viewed here. Warning: the video is rated PG-13 for language and adult situations. The focus of his talk was that the relevant metric for data center storage is $/GB/IOPS. He showed how […]

Categories
Miscellaneous Scalability

Pay Off Your Technical Debt

The first thing I do when I get my hands on a client’s code is to figure out the size of the code base. I execute something like this to determine the number of lines of code (LoC) using a fresh checkout from trunk: $ find . -name “*.java” -exec cat {} \; | wc […]

Categories
Hadoop Hive

Using Hive with Existing Files on S3

One feature that Hive gets for free by virtue of being layered atop Hadoop is the S3 file system implementation. The upshot being that all the raw, textual data you have stored in S3 is just a few hoops away from being queried using Hive’s SQL-esque language. Imagine you have an S3 bucket un-originally named […]

Categories
Configuration Hive Java Scalability

How To Try Out Hive on Your Local Machine — And Not Upset Your Ops Team

According to the Hive web site: Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. Hive is built on top of various technologies, the most notable being Hadoop and HDFS. As a result, […]