• About
  • Home
  • Oracle Enterprise Manager

Musings

~ Things I see and learn!

Tag Archives: Docker

Understanding Docker Images and Layers

08 Friday Sep 2017

Posted by adeeshfulay in Docker, Storage

≈ Leave a comment

Tags

cloud, Docker, Docker, oracle, stateful, weblogic

Not so long ago, I presented a couple of sessions at the IOUG Collaborate 17 conference. During my session ‘Docker 101 for Oracle DBAs’ there were a bunch of questions regarding Docker images and the concept of layers and their benefits. So this blog basically summarizes the discussion I had with the audience during my session.

What are Docker Images and how do they compare to VM Images?

docker layers1

For anyone who has used VMs, the concept of a VM image is not new. A Docker image is very similar and serves the same purpose, that is, they are used to create containers, but that is where the similarity ends. While VM image is a single large file, a Docker image references a list of read-only layers that represent differences in the filesystem. These layers are stacked one over the other, as shown in the image below, and form the basis of the container root filesystem. The Docker storage driver stacks and maintains the different layers. The storage driver also manages sharing of layers across images. This makes building, pulling, pushing, and copying of images fast and saves on storage.

How are Images used to create Containers?

When you spawn a container (docker run <image-name>), each gets its own thin writable container layer, and all changes are stored in this container layer, this means that multiple containers can share access to the same underlying image and yet have their own data state.

When a container is deleted, all data stored is lost. For databases and data-centric apps, which require persistent storage, Docker allows mounting host’s filesystem directly into the container. This ensures that the data is persisted even after the container is deleted, and the data can be shared across multiple containers. Docker also allows mounting data volumes from external storage arrays and storage services like AWS EBS via its Docker Volume Plug-ins.

How do I find the layers of an Image?

The older versions of Docker provided `docker images –tree` which would show the tree view of all images and layers, unfortunately in the absence of that we have to look for other options.

For illustration, I am going to use an Oracle WebLogic server image I built. This image was built using multiple other images, each with their own set of layers.

docker layers2
You can see the total size of the image, currently 1.62GB, and the image ID.

Next, we will use the `docker history <image>` command to see the layers of this image. This output below shows all the layers that make up the weblogic image, but what are all these commands in the 2nd last column? For this, we have to refer to the dockerfile. Every instruction in the dockerfile creates a new layer, and since we used multiple dockerfiles to create the different images, this view shows the aggregate of all instructions run to build the final image we are using. Here is a good reference on how to write good dockerfiles.

docker layers3

While this information is good, it still doesn’t give us the hierarchical or tree view of images and layers. For this, we will use a hack. Run the following command to download an image from DockerHub, that will print out the tree view we want.

>> docker run –rm -v /var/run/docker.sock:/var/run/docker.sock nate/dockviz images -t

docker layers4

This view shows the same set of layers as the history command, but it also shows the lineage. If I focus on just the hierarchy of images & layers used to build by Oracle Weblogic image, I see that I used 4 different dockerfiles to build 4 different images. In fact, the last two images were derived from the same set of shared layers up till layer ‘bd54831efb16 Virtual Size: 1.2 GB’, after which the tree splits. Below is the hierarchy of images I built and used.

docker layers5

Now that I know the hierarchy of images being used, I can run ‘docker history’ command for each image and see its individual layers and the dockerfile instructions used to create them.

docker layers6

Once I completed mapping all layers to their respective images, I end up with a breakdown as follows.

docker layers7

Success!!

If you want to learn more about Docker and other container formats, take a look at my blog – Containers Deep Dive – LXC vs Docker


Shameless Plug: 

I currently work for Robin Systems and we provide an excellent container-based platform for both stateless and persistent stateful applications, especially for Big data applications and relational and no-SQL databases. This platform includes the 4 key components required to run stateful applications – containers (Docker and LXC), a scale-out block storage, virtual networking, and orchestration, and the platform can be deployed both on-premises on commodity hardware, or on public clouds like AWS.

Try our free community edition!

 

Advertisement

Can Containers Ease Cassandra Management Challenges?

02 Friday Sep 2016

Posted by adeeshfulay in Docker

≈ Leave a comment

Tags

cassandra, Docker, Docker, nosql database

This is a repost of my original blog on robin systems website.

Container-based virtualization and microservice architecture have taken the world by storm. Applications with a microservice architecture consist of a set of narrowly focused, independently deployable services, which are expected to fail. The advantage – increased agility and resilience. Agility since individual services can be updated and redeployed in isolation. While given the distributed nature of microservices, they can be deployed across different platforms and infrastructures, and the developers are forced to think about resilience from the ground up instead of as an afterthought. These are the defining principles for large web-scale and distributed applications, and web companies like Netflix, Twitter, Amazon, Google, etc have benefitted significantly with this paradigm.

Add containers to the mix. Containers are fast to deploy, allow bundling of all dependencies required for the application (break out of dependency hell), and are portable, which means you can truly write your application once and deploy it anywhere. Microservice architecture and containers, together, make applications that are faster to build and easier to maintain while having overall higher quality.

Image borrowed from Martin Fowler’s excellent blog:  http://martinfowler.com/articles/microservices.html

A major change forced by microservice architecture is decentralization of data. This means, unlike monolithic applications which prefer a single logical database for persistent data, microservices prefer letting each service manage its own database, either different instances of the same database technology, or entirely different database systems.

Unfortunately, databases are complex beasts, have strong dependence on storage, have customized solutions for HA, DR, and scaling, and if not tuned correctly will directly impact application performance. Consequently, the container ecosystem has largely ignored the heart of most applications: storage, and thus limit the benefits of container-based microservices due to the inability to containerize stateful & data heavy services like databases.

Majority of the container ecosystem vendors have mostly focussed on stateless applications. Why? Stateless applications are easy to deploy and manage. For example, its ability to respond to events by adding or removing instances of a service without needing to significantly change or reconfigure the application. For stateful applications, most container ecosystem vendors have focussed on orchestration, which only solves the problem of deployment and scale, or existing storage vendors have tried to retrofit their current solutions for containers via volume plug-ins to orchestration solutions. Unfortunately this is not sufficient.

To dive deeper into this, let’s take the example of Cassandra, a modern NoSQL database, and look at the scope of management challenges that need to be addressed.

Cassandra Management Challenges

While poor schema design and query performance remains as the most prevalent problem, it is rather application and use case specific, and requires an experienced database administrator to resolve. In fact, i would say most Cassandra admins, or any DBA for the matter, enjoy this task and pride themselves at being good at it.

The management tasks which Database admins would rather avoid and have automated are:

  1. Low utilization and lack of consolidation
  2. Complex cluster lifecycle management
  3. Manual & cumbersome data management
  4. Costly scaling

 

Let’s look at these one by one.

1. Low utilization and lack of consolidation

Cassandra clusters are, typically, created per use-case or SLA (read intensive, write intensive). In fact, the common practice is to give each team its own cluster. This would be an acceptable practise if clusters weren’t deployed on dedicated physical servers. In order to avoid performance and noisy neighbor issues, most enterprises stay away from virtual machines. This unfortunately means that underlying hardware has to be sized for peak workloads, leaving large amounts of spare capacity and idle hardware due to varying load profiles.

All this leads to poor utilization of infrastructure and very low consolidation ratios. This is a big issue for enterprises on both – on-premise and in the cloud.

Underutilized servers == Wasted money.

2. Complex Cluster Lifecycle Management

Given the need for physical infrastructure (compute, network, and storage), provisioning Cassandra clusters on premise can be time consuming and tedious. The hardest thing about this activity is estimating the read and write performance that will be delivered by the designed configuration, and hence often involves extensive schema design and performance testing on individual nodes.

Besides initial deployment, enterprises also have to cater to node failures. Failures are the norm, and have to be planned for from the get go. Node failures can be temporary or permanent and can be caused due to various reasons – hardware faults, power failure, natural disaster, operator errors, etc. While Cassandra is designed to withstand node failures, it still has to be resolved by adding replacement nodes, and it poses additional load on the remaining nodes for data rebalance – post failure and again post addition of new nodes.

cassandra-nodefailure

1. Node A fails

2. Other nodes take on the load of node A

3. Node A is replaced with A1

4. Other nodes are loaded again as they stream data to node A1

 

3. Manual Data Management

Unlike traditional databases like Oracle, Cassandra does not come with utilities that automatically backup the database. Cassandra offers backup in terms of snapshots and incremental copies, but they are quite limited in features. Most notable limitations of snapshots are:

  • Use hard-links to store point-in-time-copies
  • Use the same disk as data files (compaction makes this worse)
  • Are per node
  • Do not include Schema backup
  • No support for off-site relocation
  • Have to be removed manually

Similarly, data recovery is fairly involved. Data recovery can be performed for two reasons:

  1. to recover database from failures. For example, in case of data corruption, loss of data due to an incorrect `truncate table`, etc
  2. to create a copy of it for other uses. For example, to create a clone for dev/test environments, to test schema changes, etc.

Typical steps to recover a node from data failures

cassandra-restore

In order to optimize for space used for backups, most enterprises will retain last 2 or 3 backups on the server but will move the rest to a remote location. This means based on the data sample needed, you maybe be able to restore locally on the server server or have to move files around from a remote source.

While Datastax Enterprise edition does provide notion of scheduled backups via OpsCenter, it still involves careful planning and execution.

4. Costly Scaling

With Cassandra’s ability to scale linearly, most administrators are quite accustomed to adding nodes (or scale out) to expand the size of clusters. With each node you gain additional processing power and data capacity. But while node addition is required to cater to steady increase in database usage, how does one handle transient spikes?

Let’s look at a scenario. Typically, once a year, most retail enterprises will go through the planning frenzy for Thanksgiving. Unfortunately, post that event, majority of the infrastructure would be idle or requires administrators to break the cluster and repurpose it for other uses. Wouldn’t it be interesting if there was a way to simply add/remove resources dynamically and scale the cluster up/down based on transient load demands?

Summary

Many enterprises have experimented with docker containers and open source orchestrators like mesos and kubernetes, but they soon discover that these tools along with their basic storage support in volume plugins only solve the problem of deployment and scale, but are unable to address challenges with container failover, data and performance management, and the ability to take care of transient workloads. In comes Robin Systems.

Robin is a container-based, application-centric, server and storage virtualization platform software which turns commodity hardware into a high-performance , elastic, and agile application/database consolidation platform. In particular, Robin is ideally suited for data-applications such as databases and Big data clusters as it provides most of the benefits of hypervisor-based virtualization but with bare-metal performance (up to 40% better than VMs) and application-level IO resource management capabilities such as minimum IOPS guarantee and max IOPS caps. Robin also dramatically simplifies data lifecycle management with features such as 1-click database snapshot, clones, and time-travel.

Join us for a webinar on Thursday, August 25 10am Pacific / 1pm Eastern to learn about how Robin’s converged, software-based infrastructure platform can make Cassandra management a breeze by addressing all aforementioned challenges.

 

About Me

Product Manager at CA Technologies
Formerly, at Robin Systems, VMware, Oracle, mValent.

Interests: Startups, Containers, Patriots, Warriors, Python, Problem-solving, Biking,…

Views are my own.

Follow Me

Follow @AdeeshF

View Adeesh Fulay's profile on LinkedIn

Recent Posts

  • Create EM Jobs via EMCLI
  • Create Database using EMCLI Verbs
  • Understanding Docker Images and Layers
  • Linux: Analyze disk space full issues
  • Can Containers Ease Cassandra Management Challenges?

Categories

  • agent (1)
  • Data Clone and Refresh (1)
  • Database as a service (4)
  • Deployment Procedures (1)
  • Docker (2)
  • emcli (6)
  • host management (1)
  • job system (6)
  • Linux (1)
  • oracle enterprise manager (16)
  • private cloud (5)
  • Provisioning (1)
  • Self Update (1)
  • Storage (2)

Archives

  • October 2017
  • September 2017
  • March 2017
  • September 2016
  • May 2015
  • March 2015
  • February 2015
  • November 2014

EM12c Tweets

My Tweets

Create a free website or blog at WordPress.com.

Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • Musings
    • Already have a WordPress.com account? Log in now.
    • Musings
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...