Root Cause Analysis Tool

Requirements:

Basic understanding of distributed system
Familiarity with a programming language
Some familiarity with log analytics

Description:

Distributed systems are often hard to troubleshoot. With many logs and a vast array of potential issues that can arise from hardware, network, and configuration, it is sometimes time consuming to find the root of any problem that may occur. This summer project will work to collect all of the various information sources in a single location and provide a framework around diagnosing problems.

An applicant interested in this project should have a working knowledge of distributed systems, some experience with Linux troubleshooting and log analysis, and a knowledge of some high level programming language for implementation (C++ or Python preferred).

Milestones:

GSOC 2017 community bonding:

  • understand the architecture of Ceph
  • examine how logging in Ceph works (both system-wide and locally on the mon/osd processes)

GSOC 2017 midterm:

  • rough prototype of a program that will collect data
  • working outline of a solution for how to classify root cause for diagnosing problems

GSOC 2017 final:

  • Final program with the ability to sort and filter data for fast RCA
  • Stretch goal: working GUI

ceph-mgr: Smarter Reweight-by-Utilization

Requirements:

  • The applicant should have done course work in probability and mathematical statistics
  • Basic understanding of distributed system
  • Familiarity with a programming language

Description:

Ceph uses the CRUSH algorithm to distribute data objects among storage devices. And the storage devices are weighted, so CRUSH is able to maintain a statistically balanced utilization of storage and bandwidth resources. At the very beginning, the administrator is very likely specify a weight relative to the total capacity of the device. But over the time, the utilization of storage devices could become unbalanced. And this hurts the availability of the storage pool. Because once any of the storage device assigned to a storage pool is full, the whole pool is not writeable anymore, even there is abundant space in other devices. And we also want to minimize the performance impact caused by rebalance. So a smarter reweight algorithm would be very helpful.

It’s hard to evaluate the performance of this algorithm. So the participant should build up a model or a tool to evaluate the performance, and then come up a reweight algorithm to address the problems listed above.

Milestones:

GSOC 2017 community bonding:

  • understand the CRUSH algorithm and the architecture of Ceph
  • be able to use the tools in Ceph to test the CRUSH algorithms

GSOC 2017 midterm:

  • build up a model or a tool to evaluate the performance of reweight process/algorithm

GSOC 2017 final:

  • design an algorithm which outperforms the existing reweight-by-pg and reweight-by-utilization algorithms

ceph-mgr: Cluster Status Dashboard

Requirements:

  • Familiarity with Python programming
  • Basic understanding of distributed system
  • Good understanding of front-end technologies and web standards, like HTML5, CSS3 and JavaScript
  • Basic visual design skills

Description:

Ceph-mgr is a daemon collecting the real-time statistics from a Ceph cluster. In this daemon, a python intepreter is embedded. And it exposes a set of python APIs that can be consumed by the hosted python modules. This project involves designing and prototyping a status dashboard to visualize the statistics of the cluster in different levels and perspectives.

This project can be divided into three phases:

  1. working with mentors to understand the architecture of ceph and the meaning of various metrics.
  2. understand how ceph-mgr exposes the cluster information to the hosted python module.
  3. design and implement part of the dashboard based on existing framework.

Milestones:

GSOC 2017 community bonding:

  • setup up a ceph cluster with ceph-mgr for testing
  • understand the ceph development workflow

GSOC 2017 midterm:

  • work with mentors and Ceph community to come up with a detailed design on how to improve this web app.
  • The design will include information design, navigation design and also UI design. The participant can use wireframe to mockup the new UI elements.
  • implement the prototype of the new design

GSOC 2017 final:

  • prototype of the new dashboard with the added UI element.

Additional Mentors:

  • John Spray

ceph-mgr: Slow OSD Identification, Automated Cluster Response

Requirements

  • Basic understanding of distributed system
  • Familiarity with Python programming language

Description:

Ceph-mgr is a daemon collecting the real-time statistics from a Ceph cluster. In this daemon, a python intepreter is embedded. And it exposes a set of python APIs that can be consumed by the hosted python modules. And OSD is the daemon managing the storage devices. It would be very helpful if we can identify the slow OSDs and take measures, before the performance degeneration has visible impact to user.

Milestones:

GSOC 2017 community bonding:

  • setup up a ceph cluster with ceph-mgr for testing
  • understand the ceph development workflow

GSOC 2017 midterm:

  • understand the various metrics profiling a certain OSD
  • make sure the API accessing the necessary metrics is exposed to python modules in ceph-mgr

GSOC 2017 final:

  • come up with a python module to identify the slow OSDs
  • the python module will also follow predefined policies to help administrator to address the problem.

ceph-mgr: Pool pg_num Auto-Tuning

Requirements:

  • Familiarity with Python programming
  • Basic understanding of distributed system

Description:

Ceph-mgr is a daemon collecting the real-time statistics from a Ceph cluster. In this daemon, a python intepreter is embedded. And it exposes a set of python APIs that can be consumed by the hosted python modules. In Ceph, data is distributed among storage devices in the form of “objects”, which are in turn aggregated by placement groups within a storage pool. Because tracking object placement and object metadata on a per-object basis is computationally expensive — i.e. a pool with millions of object cannot realistically track placement on a per-object basis. And the Ceph client calculates which placement group an object should be placed by hashing the object ID, so every object ID can be mapped to a certain placement group. So, better data durability and even distribution call for more placement groups, but their number should be reduced to the minimum to save CPU and memory. As each placement group is served by a set of OSDs, so the more placement groups served by an OSD, this OSD demands more CPU and memory resource. In short, it is a tradeoff. So apparently, when there is more objects stored in a given pool, the number of placement group should be tuned. For more infomation, see http://docs.ceph.com/docs/master/rados/operations/placement-groups/.

Milestones:

GSOC 2017 community bonding:

  • setup up a ceph cluster with ceph-mgr for testing
  • Understand how Ceph uses placement groups for data placement and migration
  • Understand the tradeoffs

GSOC 2017 midterm:

  • build a simple python module to calculate the placement number from given settings
  • embed this python module in ceph-mgr

GSOC 2017 final:

  • integrate this python module into the dashboard of the ceph-mgr

ceph-mgr: Commands for CephFS Client Auth Caps Creation

Requirements

  • Basic understanding of distributed systems
  • Familiarity with Python programming language

Description:

Ceph-mgr is a daemon collecting the real-time statistics from a Ceph cluster. In this daemon, a python intepreter is embedded. And it exposes a set of python APIs that can be consumed by the hosted python modules. And caps (short for “capabilities”) is the way how Ceph describe authorizing an authenticated user to exercise the functionality of different components of Ceph. To grant a CephFS client the access to a certain directory, one need to use the ceph command line to create the corresponding caps for it. If we are able to leverage ceph-mgr to do this job, it would be greatly simplified this process.

Milestones:

GSOC 2017 community bonding:

  • setup up a ceph cluster with cephfs for testing
  • understand the ceph development workflow

GSOC 2017 midterm:

  • setup cephfs and be able to create a new client with required auth caps
  • understand the ceph-mgr framework

GSOC 2017 final:

  • come up with a python module to autmate the creatation of the cephfs client auth

RGW: Local File Backend for RGW

Requirements:

  • The applicant should have done course work in Operating Systems and Databases.
  • Be very comfortable using C++ and a Linux development environment, and working with a large code base.
  • Interest in application architecture and design

Description:
This project involves designing and prototyping a simplified database backend for the Ceph Object Gateway (RGW). This backend will make it possible for developers to set up an RGW without the rest of Ceph. The abstraction for the backend would be useful for adding other backends in the future.

The first half of the project involves working with mentors to understand RGW internals and architecture, writing up a multi-page design document, and setting up a stand alone prototype in C++ that implements some object operations on a database. The design document will show how a file-backend would fit into the rest of the RGW architecture.

The second half of the project involves implementing a prototype of the design using a database as the backend. Participants should have made substantial progress or completed a prototype by the end of coding.

Milestones:
GSoC 2017 Community Bonding:

  • Comfortable setting up and working a Ceph development environment

GSoC 2017 Midterm:

  • Set up a stand alone prototype that implements some operations on a database
  • Worked with mentors and Ceph community to come up with a detailed design overview for File Backend

GSoC 2017 Final:

  • Prototype of of design

Additional Mentors:

  • Matt Benjamin
  • Casey Bodley

RGW: Multi Language S3 testing for Rados Gateway

Requirements:

  • Python programming experience and interest in learning new languages
  • Comfortable using a Linux development environment

Description:
The Ceph testing framework is almost entirely written in python. To test the Amazon S3 storage protocol and Openstack Swift storage, there are many major SDKs used that are in languages other than python.
Bugs have shown up which only occur for these other non-python SDKs. This project would involve implementing those tests from these other SDKs, add tests to the other SDKs, and make them ready to be used in Ceph’s upstream testing framework: teuthology.
One of the main challenges of this project will be in working within a multi-host, multi-OS,
multi-language systems environment.

Milestones:
GSOC 2017 community bonding:

  • test cluster running ceph / radosgw
  • a separate test virtual machine with various languages SDKs installed
  • understand the ceph developer workflow

GSOC 2017 midterm:

  • determine a core subset of tests across all the SDK
  • make sure the core set is implemented in one new language
  • have at least one test written for each of the other SDK’s in different languages
  • documentation of runtime requirements for various SDK’s to be used in Ceph’s upstream test framework

GSOC 2017 final:

  • implement some of the core set in other the SDK’s
  • integration of SDK’s into teuthology, with successful test suite runs

Additional Mentors:

  • Matt Benjamin
  • Casey Bodley
  • Marcus Watts