ceph - down pgs after lost osd
Posted on Wed 19 August 2020 in Ceph • 1 min read
Initial situation / Issue
- pool size = 1 (Don't ask why)
- lost/crashed osd
- pgs down/incomplete
Verify
$ ceph pg <id> query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "active+clean",
"epoch": 348,
"acting": [
87
],
[...]
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
60],
"peering_blocked_by": [
{ "osd": 60,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
[...]
Resolution
Mark the missing osd as lost:
$ ceph osd lost osd.<id> --yes-i-really-mean-it
You can try to create the pg:
$ ceph pg force-create-pg <id>
But because the PG is already created, this will usually fail or have no effect.
In the meantime ceph already tries to start the PG on a new OSD - without success (PG is incomplete
).
Therefore the faulty PG must be created/marked as an empty PG (data loss!)
systemctl stop ceph-osd@87
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-87 \
--pgid <id> --op info
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-87 \
--pgid <id> --op mark-complete
systemctl start ceph-osd@87
Root Cause
The pool size is == 1.