The Ceph Blog

Updates to Ceph tgt (iSCSI) support

In a previous blog post I introduced work we’ve done to the user-space tgt iSCSI project to allow exporting RADOS block device (rbd) images as iSCSI targets. I’ve recently taken a short break from working on the Calamari project to update that support to bypass some limitations and add some functionality.

The tgt-admin utility now works with the rbd backend bs_rbd. tgt-admin is used to set up tgtd from a target-configuration file, and is typically used at boot time, so this makes it handier to have persistent targets mapped on a host.

There is no more 20-rbd-image-per-tgtd limit.

tgtadm accepts a new –bsopts parameter for each mapped image to set bs_rbd options:

  • conf=<path-to-ceph.conf> allows you to refer to a different ceph cluster for each image (each image has its own cluster connection)
  • id=<client-id> allows each image to use a different Ceph client id, which allows per-client configuration for each image (including things like permissions, log settings, rbd cache settings, etc.) The full client name will be “client.<client-id>” in normal Ceph fashion. (The default id is “admin”, as usual, for a default client name of “client.admin”.)

So, for example, you might use

tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --bstype rbd --backing-store public-image --bsopts "conf=/etc/ceph/pubcluster.conf;id=public"

to establish a target in the “pubcluster” for an image named “public-image” whose configuration is expressed in sections named “client.public”. (The doublequotes are required to hide the ‘;’ bsopts separator from the shell.)

You can pick up packages built with the Ceph rbd support from the Debian and RPM repositories at http://ceph.com/packages/ceph-extras.

Comments: Updates to Ceph tgt (iSCSI) support

  1. I needed to do this change to tgt-admin to support setting bs-type inside the setting of targets.cconf. This approach should make it easier to support more bs-types in tgt-admin. Perhaps you can submit it upstream?

    — /usr/sbin/tgt-admin.orig 2013-11-05 03:22:04.000000000 +0000
    +++ /usr/sbin/tgt-admin 2013-11-18 18:53:27.081414567 +0000
    @@ -484,10 +484,24 @@
    # Is the device in use?
    my $can_alloc = 1;
    - if ($bstype !~ “rbd” && $force != 1 && $$target_options_ref{‘allow-in-use’} ne “yes”) {
    + if ($force != 1 && $$target_options_ref{‘allow-in-use’} ne “yes”) {
    $can_alloc = check_device($backing_store,$data_key_ref);
    }

    - if ($can_alloc == 1 &&
    - ($bstype =~ “rbd” || (-e $backing_store && ! -d $backing_store))) {
    + # Load up lun-specific bstype
    + if (ref $value eq “HASH”) {
    + if ($$value{$backing_store}{‘bs-type’}) {
    + $bstype = $$value{$backing_store}{‘bs-type’};
    + }
    + }
    + # Does the device exist?
    + my $device_exists = 0;
    + if ($bstype =~ “rbd”) {
    + system “rbd info $backing_store >/dev/null”;
    + if ($? == 0) { $device_exists = 1; }
    + } else {
    + $device_exists = 1 if (-e $backing_store && ! -d $backing_store);
    + }
    +
    + if ($can_alloc == 1 && $device_exists == 1) {
    my @exec_commands;
    my $device_type;

    Posted by Walter Huf
    November 18, 2013 at 6:57 pm
    • Thanks Walter. I haven’t had a chance to look at this yet, but I had tested with an rbd and a plain rdwr image in the same targets.conf, and I thought they worked. Could be I’m mistaken though, and so I want to try that test again and study your change, but I’ve just been backed up with other stuff.

      Posted by dmick
      November 26, 2013 at 6:11 am
  2. In order to extend Cephs reliability to iSCSI, can this be set up to use iSCSI multi-path with two iSCSI servers? I’ve read elsewhere that Ceph does not support have having two connections to the same RBD volume, which would seem to suggest this is not possible. Would it work using Active/passive (e.g. both paths to RBD are not active at the same time)?

    Posted by Tom
    November 19, 2013 at 7:14 pm
    • Hi Thomas. I haven’t looked in detail at iSCSI multipath, but it seems that when most people talk about it, they’re talking about multiple NICs/cables/switch paths to a single target device. That’s certainly one area of redundancy. As far as I know, tgt/Ceph don’t affect setting up such redundant access; it’s just like any other iSCSI target as far as IP multipathing goes.

      The other sort of redundancy (multiple hosts sharing access to a single target) is more challenging, and you’re right, Ceph has no real access moderation at the RBD level; the way I’d expect this to work is for the iSCSI target to support persistent reservations. I know work has been done on tgt to support such things (with reservation state stored in a separate filesystem, for example), but I don’t know the current state of it. Of course whatever clustering software was trying to share the device would need to issue the SCSI reservation commands to arbitrate access to the device. Of course this implies that the iSCSI target host is still a SPOF unless the clustering software can also manage multiple targets.

      So the answer is “no worries for IP redundancy” and “basic access control possibly supported by tgt today for a larger clustering solution using upper-layer software.”

      Posted by dmick
      November 19, 2013 at 7:25 pm
  3. Great work dmick. I hacked in some similar changes but not as extensively.

    Are there any plans to officially port or implement rbd to windows? We would be keen to implement that but I see someone is already working on it. The trouble is there are no documents detailing the ceph protocols. Will these be released at any point or is the source code the only reference?

    Posted by Matt
    November 25, 2013 at 11:26 pm
    • Your best bet is to bring that up on the ceph-devel list. I know people are looking at making various pieces of Ceph run on Windows, but I don’t know what their status or plans are.

      Posted by dmick
      November 26, 2013 at 6:09 am
  4. I guess the ultimate for me is to be able to boot a windows machine off an RBD image without the limitations of iSCSI.

    Posted by Matt
    November 25, 2013 at 11:27 pm
  5. @Matt

    I’m running windows machines with qemu-rbd on xen.
    That boots directly from an RBD image although via Qemu.

    Posted by Bram
    November 26, 2013 at 11:22 am
  6. How to use rbd cache in the “id” settings? Any examples?
    Thanks

    Posted by bcat
    December 17, 2013 at 5:54 pm
    • Using ‘id’ sets your client name, which is “client.<id>”; once you do that, you can put sections in ceph.conf that contain settings specifically for that client. See http://ceph.com/docs/master/rados/configuration/ceph-conf/ for general configuration information; just like you can specify [osd.1], you can specify [client.my_tgtd_clientname] (if you want the settings specific to the tgtd client).

      Posted by dmick
      December 17, 2013 at 9:40 pm
  7. Great project! CEPH really required such a tool for compatibility with legacy systems limited by iSCSI only support.

    Unfortunately, redundancy is the issue. When using CEPH RBD natively we have several MONs to connect to and even if one MON fails system (like KVM) will transparently reconnect to another one by use of a round-robin DNS.

    I thought of using this iSCSI / RBD approach to boot my bare metal servers with iPXE from RBD volumes but afraid of having SPOF on iSCSI target server.

    Can anyone suggest me an alt way of PXE booting servers from RBD volumes? There should be a broad specter of usage patterns for this, especially in cloud setups.

    Posted by Dmit
    February 5, 2014 at 11:05 am
  8. Could you clarify for me, is the end result here any different from mapping an RBD to the server and then setting up TGT to use that RBD is the backend for an iSCSI target? Is this basically just a shortcut to that setup process or is there functionality or performance advantages to doing it this way?

    Not that convenience doesn’t have value, I’m just curious if I’m missing something. Thanks! :)

    Posted by Joe
    February 24, 2014 at 7:39 pm
    • “mapping an RBD on the server” involves using the kernel RBD block device; this module operates exclusively in userland (contacts the cluster through the userland librados). There are several differences there; one is that the userland code typically is more up-to-date in terms of features (since it doesn’t have to sediment into the kernel release, which is a big deal especially for some of the distros that value stability over features). Another is potential performance: context switching is obviously lower, and there’s dedicated cache in librbd that could work better for certain workloads. To my knowledge, no one’s done any benchmarking.

      Posted by dmick
      February 24, 2014 at 8:07 pm
  9. @Dmit – I’m with you… I’d like to understand how to PXE boot a group of diskless XenServer hosts straight from a ceph cluster using RBD and bypassing iSCSI. Then… once the Xen Host Servers are up (basically running in RAM), we can boot guests straight from Ceph all day long.

    We get all the advantages of the high availability and performance without having to “re-implement” a clustered HA iSCSI target etc. iPXE would provide a free starting point… but it is beyond me how to do it.

    Is this even a possibility?

    Posted by Stephen Perkins
    May 12, 2014 at 10:02 pm
  10. @Dmit/Stephen – It’s actually quite easy to PXE boot Linux from RBD or CephFS.

    I managed to do it by just booting the kernel with a customised initramfs created using initramfs-tools under Ubuntu Server. The process is similar to the NFS approach but I created an RBD script based on the nfs one which calls rbd map and mount instead of nfs mount and also create hook scripts to copy the rbd module into the kernel initial ramdisk. After that it switches to the real root and away it goes.

    I think iSCSI should be avoided in general, but it is the best option for booting bare metal Windows servers while no native kernel driver client exists in windows.

    For redundancy you might be able to use a couple of iscsi targets in HA.

    Posted by Matt
    May 30, 2014 at 2:06 am
  11. When I use an rbd image via iscsi, I can dd from and to it as a raw device without any issues and good performance.
    The moment I create a filesystem on it, the iscsi device goes to 100% busy with no thruput..
    And these messages appear:

    Jun 29 23:12:37 localhost kernel: [514938.441408] sd 10:0:0:3: [sdi] Unhandled error code
    Jun 29 23:12:37 localhost kernel: [514938.442025] sd 10:0:0:3: [sdi]
    Jun 29 23:12:37 localhost kernel: [514938.442591] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
    Jun 29 23:12:37 localhost kernel: [514938.443180] sd 10:0:0:3: [sdi] CDB:
    Jun 29 23:12:37 localhost kernel: [514938.443763] Write(10): 2a 00 00 00 00 ff 00 00 01 00
    Jun 29 23:12:37 localhost kernel: [514938.445289] blk_update_request: 20 callbacks suppressed
    Jun 29 23:12:37l ocalhost kernel: [514938.446549] sd 10:0:0:3: [sdi] Unhandled error code
    Jun 29 23:12:37 localhost kernel: [514938.458706] sd 10:0:0:3: [sdi]
    Jun 29 23:12:37 localhost kernel: [514938.459268] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
    Jun 29 23:12:37 localhost kernel: [514938.459855] sd 10:0:0:3: [sdi] CDB:
    Jun 29 23:12:37 localhost kernel: [514938.460418] Write(10): 2a 00 00 00 00 00 00 00 ff 00

    any suggestions ?

    I’ve got simular issues when the iscsi target is connected to a windows environment

    Posted by Bram
    June 29, 2014 at 9:20 pm

Add Comment

© 2013, Inktank Storage, Inc.. All rights reserved.