Skip to content

Ceph in the city: introducing my local Kubernetes to my ‘big’ Ceph cluster

Ceph has long been a favourite technology of mine. Its a storage mechanism that just scales out forever. Gone are the days of raids and complex sizing / setup. Chuck all your disks into whatever number of servers, and let ceph take care of it. Want more read speed? Let it have more read replicas. Want a filesystem that is consistent on many hosts? Use cephfs. Want your OpenStack Nova/Glance/Cinder to play nice, work well, and have tons of space? use ceph.

TL;DR: want to save a lot of money in an organisation, use Ceph.

Why do you want these things? Cost and scalability. Ceph can dramatically lower the cost in your organisation vs running a big NAS or SAN. And do it for higher performance and better onward scalability. Don’t believe me? Check youtube

My ceph system at home is wicked fast, but not that big. Its 3 x 1TB NVME. We talked about this earlier, and you may recall the beast-of-the-basement and its long NVME challenges. Its been faithfully serving my OpenStack system for a while, why not the Kubernetes one?

NVME is not expensive anymore. I bought 3 of these. $200/each for 1TB. But, and this is really trick-mode, it has built-in capacitor ‘hard power down’. So you don’t have to have a batter-backed raid. If your server shuts down dirty the blocks still flush to ram, meaning you can run without hard-sync. Performance is much higher.

OK, first we digress. Kubernetes has this concept of a ‘provisioner’. Sort of like cinder. Now, there are 3 main ways I could have gone:

  1. We use ‘magnum’ on OpenStack, it creates Kubernetes clusters, which in turn have access to Ceph automatically
  2. We use OpenStack Cinder as the PVC of Kubernetes.
  3. We use Ceph rbd-provisioner of Kubernetes

I tried #1, it worked OK. I have not tried #2. This post is about #3. Want to see? Lets dig in. Pull your parachute now if you don’t want to be blinded by YAML.

cat <



Now we need to create the StorageClass. We need the **NAME** of 1 or more of the mons (you don't need all of them), replace MONHOST1 w/ your **NAME**. Note, if you don't have a name for your monhost, and want to use an IP, you can create an external service w/ xip.io:

kind: Service
apiVersion: v1
metadata:
  name: monhost1
  namespace: default
spec:
  type: ExternalName
  externalName: 1.2.3.4.xip.io
and you would then use monhost1.default.svc.cluster.local as the name below.cat <

2 thoughts on “Ceph in the city: introducing my local Kubernetes to my ‘big’ Ceph cluster”

  1. Adriano da Silva

    Hello.
    Manage a data center in a small business. I use seven old Proxmox and Ceph Hyperconverged servers to store and process virtual machines with Linux and Windows, which are accessed remotely via Terminal Service.

    As I use HDDs on all servers, and some are consumer HDDs, I have writing performance that is sometimes not enough.

    Here, the budget is small and using enterprise HDDs is not always an option. SSD’s are also difficult, as they are expensive, especially the ones with tantalum capacitors.

    So, I found this NVMe Hynix in the photo for sale for a little bit and bought some units to try to configure as a cache, to try to reduce and stabilize the access time.

    I thought of, in addition to putting DB/Wall on it, also doing a small cache on a separate partition using bcache.

    But I haven’t found documentation with technical details about this NVMe. Do you know where there would be information about its erasure block size and whether it is suitable for intensive writes?

    Do you think the strategy I put in can help?

    Thank you!

    1. so…
      the nvme i had pictured is quite slow. Its nice since the capacitors on it means its suitable for use in e.g. a WAL log (so on a sudden power down it will not corrupt).

      You will definitely want the logs on NVME rather than HDD, it makes it much much faster.
      The one i used (pictured) is a hynix PE3110 (or HFS960GD0MEE-5410A). Its quite old now (launched in 2016?).
      https://www.digchip.com/datasheets/parts/datasheet/2/202/HFS960GD0TEG-6410A.php has some of the specifications for the next generation after it.

      don@x9:~$ sudo nvme list
      Node SN Model Namespace Usage Format FW Rev
      ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
      /dev/nvme0n1 S318NA0HB01128 SAMSUNG MZVKW1T0HMLH-00000 1 156.79 GB / 1.02 TB 512 B + 0 B CXA7300Q
      /dev/nvme1n1 EI65Q082110206713 HFS960GD0MEE-5410A 1 960.20 GB / 960.20 GB 512 B + 0 B 40032A00
      /dev/nvme2n1 EI65Q08211020670V HFS960GD0MEE-5410A 1 960.20 GB / 960.20 GB 512 B + 0 B 40032A00
      /dev/nvme3n1 EI65Q08211020671D HFS960GD0MEE-5410A 1 960.20 GB / 960.20 GB 512 B + 0 B 40032A00

      its a high endurance device, meant for servers, but i don’t see the datasheet.

      if you want the output of some nvmectl command on mine in case you were thinking of the same, rather than similar, model, let me know.

Leave a Reply

Your email address will not be published. Required fields are marked *