ceph – setting up rbd-mirror between two ceph clusters

Environment
2x ceph cluster (aio) running centos 7.2 /w ceph jewel. Added a 2nd crush rule to both clusters:

rule rep_osd {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step choose firstn 0 type osd
	step emit
}

(ceph crush map)

Setup

Install the rbd-mirror package in both sides. Technically they can run on any host even when they are not part of the cluster.

[root@ceph01 ~]# yum install -y rbd-mirror
[root@ceph04 ~]# yum install -y rbd-mirror
[root@ceph01 ~]# rbd --cluster primary mirror pool info
Mode: disabled
[root@ceph04 ~]# rbd --cluster secondary mirror pool info
Mode: disabled

Check that the cluster name is set. All systemd unit files are including that file during the startup.

[root@ceph01 ~]# grep -i cluster /etc/sysconfig/ceph 
CLUSTER=primary
[root@ceph04 ~]# grep -i cluster /etc/sysconfig/ceph 
CLUSTER=secondary

Create a key on both clusters which is able to access (rwx) the pool. (ceph authorization (caps))

[root@ceph01 ~]# ceph --cluster primary auth get-or-create client.primary mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=rbd' -o /etc/ceph/primary.client.primary.keyring
[root@ceph04 ~]# ceph --cluster secondary auth get-or-create client.secondary mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=rbd' -o /etc/ceph/secondary.client.secondary.keyring

Enable pool mirroring and verify that it is active.

[root@ceph01 ~]# rbd --cluster primary mirror pool enable rbd pool
[root@ceph01 ~]# rbd --cluster primary mirror pool info
Mode: pool
Peers: none
[root@ceph04 ~]# rbd --cluster secondary mirror pool enable rbd pool
[root@ceph04 ~]# rbd --cluster secondary mirror pool info
Mode: pool
Peers: none

Copy the keys and configs between the clusters. The rbd-mirror in the primary cluster requires the key from the secondary and vice versa.

[root@ceph01 ~]# scp /etc/ceph/primary.client.primary.keyring /etc/ceph/primary.conf root@ceph04:/etc/ceph/
primary.client.primary.keyring
primary.conf
[root@ceph04 ~]# scp /etc/ceph/secondary.client.secondary.keyring /etc/ceph/secondary.conf root@ceph01:/etc/ceph/
secondary.client.secondary.keyring  
secondary.conf

Enable/start the ceph-rbd-mirror – extend the unit name with the local cluster name.

[root@ceph01 ceph]# systemctl start ceph-rbd-mirror@primary
[root@ceph04 ceph]# systemctl start ceph-rbd-mirror@secondary

Add the remote cluster as a peer. Example: client.secondary represent the key name and @secondary the cluster name. That mean rbd-mirror is looking for a key like /etc/ceph/secondary.client.secondary.keyring.

[root@ceph01 ceph]# rbd --cluster primary mirror pool peer add rbd client.secondary@secondary 
49c28a78-ef7d-4f12-b003-7ce69f091b85
[root@ceph04 ceph]# rbd --cluster secondary mirror pool peer add rbd client.primary@primary
02053868-7dd7-4029-b287-53a205fdd668

Thats it! Now create a rbd image and activate the exclusive-lock and journaling feature. (man 8 rbd)

[root@ceph01 ceph]# rbd --cluster primary create test-1 --size 5M --image-feature exclusive-lock,journaling
[root@ceph01 ceph]# rbd --cluster primary create test-2 --size 5M --image-feature exclusive-lock,journaling

The test-1 image is active on the primary cluster, test-2 is active on the secondary cluster.

[root@ceph04 ceph]# rbd --cluster secondary mirror image demote rbd/test-1
[root@ceph01 ceph]# rbd --cluster primary mirror image promote rbd/test-1

[root@ceph01 ceph]# rbd --cluster primary mirror image demote rbd/test-2
[root@ceph04 ceph]# rbd --cluster secondary mirror image promote rbd/test-2
[root@ceph01 ceph]# rbd --cluster primary mirror pool status --verbose
health: OK
images: 2 total
    1 replaying
    1 stopped

test-1:
  global_id:   ed021ec4-2a44-4b9f-9efa-10590ffcb916
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-14 14:49:07

test-2:
  global_id:   d99bbff5-14fb-4e07-a596-69e55608f14a
  state:       up+replaying
  description: replaying, master_position=[object_number=3, tag_tid=4, entry_tid=3], mirror_position=[object_number=3, tag_tid=4, entry_tid=3], entries_behind_master=0
  last_update: 2016-10-14 14:49:09

[root@ceph01 ceph]# rbd --cluster primary ls -l
NAME    SIZE PARENT FMT PROT LOCK 
test-1 5120k          2           
test-2 5120k          2      excl 
[root@ceph04 ceph]# rbd --cluster secondary mirror pool status --verbose
health: OK
images: 2 total
    1 replaying
    1 stopped

test-1:
  global_id:   ed021ec4-2a44-4b9f-9efa-10590ffcb916
  state:       up+replaying
  description: replaying, master_position=[object_number=0, tag_tid=3, entry_tid=0], mirror_position=[object_number=0, tag_tid=3, entry_tid=0], entries_behind_master=0
  last_update: 2016-10-14 14:49:21

test-2:
  global_id:   d99bbff5-14fb-4e07-a596-69e55608f14a
  state:       up+stopped
  description: remote image is non-primary or local image is primary
  last_update: 2016-10-14 14:49:21

[root@ceph04 ceph]# rbd --cluster secondary ls -l
NAME    SIZE PARENT FMT PROT LOCK 
test-1 5120k          2      excl 
test-2 5120k          2   

Google Software Updater fuckups

To disable the ksfetch (ks = keystone) daemon (which comes with google products) there are several ways to do this.

  1. Uninstall the Google Software Update Agent
        $ /Library/Google/GoogleSoftwareUpdate/GoogleSoftwareUpdate.bundle/Contents/Resources/GoogleSoftwareUpdateAgent.app/Contents/Resources/ksinstall [--nuke]
        

    The --nuke parameter will also remove ksfetch related stuff.

  2. Set the checkInterval to the maximum (24h). Default is 5h (18000)
        $ defaults read com.google.Keystone.Agent checkInterval
        $ defaults write com.google.Keystone.Agent checkInterval 86400
        

ejabberd + letsencrypt (ssl config)

[...]
listen: 
  - 
    port: 5222
    module: ejabberd_c2s
    certfile: "/etc/ejabberd/ejabberd.pem"
    starttls: true
    starttls_required: true
    protocol_options:
      - "no_sslv2"
      - "no_sslv3"
      - "no_tlsv1"
      - "no_tlsv1_1"
    ciphers: "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"
    dhfile: "/etc/ejabberd/dh2048.pem"
    [...]
  - 
    port: 5269
    ip: "::"
    module: ejabberd_s2s_in
    protocol_options:
      - "no_sslv2"
      - "no_sslv3"
      - "no_tlsv1"
      - "no_tlsv1_1"

[...]
s2s_use_starttls: required
s2s_certfile: "/etc/ejabberd/ejabberd.pem"
s2s_dhfile: "/etc/ejabberd/dh2048.pem"
s2s_ciphers: "ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256"

s2s_protocol_options:
  - "no_sslv2"
  - "no_sslv3"
  - "no_tlsv1"
  - "no_tlsv1_1"

Links https://docs.ejabberd.im/admin/guide/configuration/

RHEV/ovirt – can’t switch SPM role – async_tasks are stucked

On the host with the SPM role

$ vdsClient -s 0 getAllTasksStatuses
{'status': {'message': 'OK', 'code': 0}, 'allTasksStatus': {'feb3aaa5-ec1c-42a6-8f17-f7c94891b43f': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': '631fd441-0955-49da-9376-1cba24764aa7', 'taskResult': 'success', 'taskState': 'finished'}, 'b4fe0c6d-d458-4ed2-a9e2-2c0d41914b8f': {'message': '1 jobs completed successfully', 'code': 0, 'taskID': '67e1a2e8-3747-43fa-b0dd-fc469a6f6a02', 'taskResult': 'success',
'taskState': 'finished'}}}

On the RHEV/ovirt manager

$ for i in b4fe0c6d-d458-4ed2-a9e2-2c0d41914b8f feb3aaa5-ec1c-42a6-8f17-f7c94891b43f; do psql --dbname=engine --command="DELETE FROM async_tasks WHERE vdsm_task_id='${i}'"; done
$ for j in b4fe0c6d-d458-4ed2-a9e2-2c0d41914b8f feb3aaa5-ec1c-42a6-8f17-f7c94891b43f; do vdsClient -s 0 clearTask ${j}; done

ROSE Xeon CW (2015) & power2max Rotor 3D+

Rahmen: ROSE Xeon CW 2015

Innenlager: Rotor Pressfit 4630 (PF46-68-30)

Kurbel: Rotor 3D+ mit p2m Spider

Spacer laut Specs: 1x A + 1x E auf der Ds, 1x A auf der NDs

Spacer verbaut: 2x A auf der Ds

Mit den von Rotor vorgesehen Spacern schleift der Spider am Rahmen. Laut ROSE und einem Sportlabor ist es kein Problem den 2.5mm Spacer von der NDs auf die Ds zustecken.

RHEV/ovirt – find stucked / zombie tasks

Random notes

$ vdsClient -s 0 getAllTasksStatuses
$ vdsClient stopTask <taskid>
$ vdsClient clearTask <taskid>
$ su - postgres
$ psql -d engine -U postgres
> select * from job order by start_time desc;
> select DeleteJob('702e9f6a-e2a3-4113-bd7d-3757ba6bc4ef');

or

/usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select * from job;"

entropy inside a virtual machine

Sometimes my ceph-(test!)deployments inside a VM failed.

The Problem is that the kernel/cpu can not provide enough entropy (random numbers) for the ceph-create-keys command – so it stuck/hang. It is not a ceph problem! This can also happen with ssl commands.

But first things first – we need to check the available entropy on a system:

cat /proc/sys/kernel/random/entropy_avail

The read-only file entropy_avail gives the available entropy.
Normally, this will be 4096 (bits), a full entropy pool (see man 4 random)

Values less than 100-200, means you have a problem!

For a virtual machine we can create a new device – virtio-rng. Here is a xml example for libvirt.

<rng model='virtio'>
  <backend model='random'>/dev/random</backend>
</rng>

That is ok for ONE virtual machine on the hypervisor. Usually we find more than one virtual machine. Therefore we need to install the rng-tools package on the virtual machines.

$pkgmgr install rng-tools
systemctl enable rngd
systemctl start rngd

That’s it! That solved a lot of my problems 😉

Mac OS – bashrc / homebrew – random notes

bash-completion

if [ -f $(brew --prefix)/etc/bash_completion ]; then
    source $(brew --prefix)/etc/bash_completion
fi

generic colouriser

example - colour

if [ -f $(brew --prefix)/etc/grc.bashrc ]; then
    source $(brew --prefix)/etc/grc.bashrc
fi

spotlight for cmd

spotlight () { mdfind "kMDItemDisplayName == '$@'wc"; }

Openstack Horizon – leapyear bug

Switching the language in the dashboard ends with a error.

day is out of range for month

eg. https://bugs.launchpad.net/horizon/+bug/1551099

[Mon Feb 29 09:20:05 2016] [error] Internal Server Error: /settings/
[Mon Feb 29 09:20:05 2016] [error] Traceback (most recent call last):
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/django/core/handlers/base.py", line 112, in get_response
[Mon Feb 29 09:20:05 2016] [error]     response = wrapped_callback(request, *callback_args, **callback_kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/horizon/decorators.py", line 36, in dec
[Mon Feb 29 09:20:05 2016] [error]     return view_func(request, *args, **kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/horizon/decorators.py", line 52, in dec
[Mon Feb 29 09:20:05 2016] [error]     return view_func(request, *args, **kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/horizon/decorators.py", line 36, in dec
[Mon Feb 29 09:20:05 2016] [error]     return view_func(request, *args, **kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/django/views/generic/base.py", line 69, in view
[Mon Feb 29 09:20:05 2016] [error]     return self.dispatch(request, *args, **kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/django/views/generic/base.py", line 87, in dispatch
[Mon Feb 29 09:20:05 2016] [error]     return handler(request, *args, **kwargs)
[Mon Feb 29 09:20:05 2016] [error]   File "/usr/lib64/python2.6/site-packages/django/views/generic/edit.py", line 171, in post
[Mon Feb 29 09:20:05 2016] [error]     return self.form_valid(form)
[Mon Feb 29 09:20:05 2016] [error]   File "/srv/www/openstack-dashboard/openstack_dashboard/wsgi/../../openstack_dashboard/dashboards/settings/user/views.py", line 38, in form_valid
[Mon Feb 29 09:20:05 2016] [error]     return form.handle(self.request, form.cleaned_data)
[Mon Feb 29 09:20:05 2016] [error]   File "/srv/www/openstack-dashboard/openstack_dashboard/wsgi/../../openstack_dashboard/dashboards/settings/user/forms.py", line 89, in handle
[Mon Feb 29 09:20:05 2016] [error]     expires=_one_year())
[Mon Feb 29 09:20:05 2016] [error]   File "/srv/www/openstack-dashboard/openstack_dashboard/wsgi/../../openstack_dashboard/dashboards/settings/user/forms.py", line 32, in _one_year
[Mon Feb 29 09:20:05 2016] [error]     now.minute, now.second, now.microsecond, now.tzinfo)
[Mon Feb 29 09:20:05 2016] [error] ValueError: day is out of range for month

SUSE Openstack Cloud – sleshammer – pre/post scripts – pxe trigger

Enable root login for the sleshammer image

(it is used by the suse cloud as a hardware discovery image)

The sleshammer image will mount “/updates” over nfs from the admin node and execute the control.sh. This script will check if there are some pre/post-hooks and will possibly execute them.

root@admin:/updates # cat /updates/discovered-pre/set-root-passwd.hook
#!/bin/bash
echo "root" | passwd --stdin root

echo
echo
echo "ROOT LOGIN IS NOW ENABLED!"
echo
echo
sleep 10

Make sure that the hook set as executable!

SUSE Openstack Cloud supports only pre and post scripts. discovered is the state – discovery or hardware-installed should also work.

BTW: You can also create custom control.sh-script (and also hooks) for a node!

mkdir /updates/d52-54-00-9e-a6-90.cloud.default.net/
cp /updates/control.sh /updates/d52-54-00-9e-a6-90.cloud.default.net/

Some random notes – discovery/install

default pxelinux configuration
(see http://admin-node:8091/discovery/pxelinux.cfg/)

DEFAULT discovery
PROMPT 0
TIMEOUT 10
LABEL discovery
  KERNEL vmlinuz0
  append initrd=initrd0.img crowbar.install.key=machine-install:34e4b23a970dbb05df9c91e0c1cf4b512ecaa7b839c942b95d86db1962178ead69774a9dc8630b13da171bcca0ea204c07575997822b3ec1de984da97fca5b84 crowbar.hostname=d52-54-00-8b-c2-17.cloud.default.net crowbar.state=discovery
  IPAPPEND 2

allocated node

The sleshammer-image will wait for this entry (.*_install) on the admin-node once you allocate a node.

DEFAULT suse-11.3_install
PROMPT 0
TIMEOUT 10
LABEL suse-11.3_install
  KERNEL ../suse-11.3/install/boot/x86_64/loader/linux
  append initrd=../suse-11.3/install/boot/x86_64/loader/initrd   crowbar.install.key=machine-install:34e4b23a970dbb05df9c91e0c1cf4b512ecaa7b839c942b95d86db1962178ead69774a9dc8630b13da171bcca0ea204c07575997822b3ec1de984da97fca5b84 install=http://192.168.124.10:8091/suse-11.3/install autoyast=http://192.168.124.10:8091/nodes/d52-54-00-8b-c2-17.cloud.default.net/autoyast.xml ifcfg=dhcp4 netwait=60
  IPAPPEND 2