Tuesday, August 29, 2023

RAID 0+1 vs 1+0

 Much has been written about RAID 10 vs RAID 01, but most seem to gloss over why one is better than the other, or go into mathematical details comparing the failure rates.  While I love a good math discussion, I felt like there was a more intuitive way to explain it.

RAID0 is fragile - a device failure takes down the whole group.

RAID1 is robust - a device failure shifts transactions to the remaining device(s).

A device failure will propagate through any layers that are not robust, so you want your robust layer closest to the devices that can fail.

Q.E.D.


Friday, March 26, 2021

My ideal Elastic server

Over and over, the IT crowd at the places I have worked have insisted on virtualizing every server, and over and over, Elasticsearch falls on its knees due to network latency accessing SAN storage from their VMs.  I complain about the latency, and I hear that they'll switch me to the newest, fastest SAN technology, and that should "solve all my problems."®

IT likes to virtualize things because they can cram many more virtual servers into high-powered hardware when compared to standing up a physical server for each function.  However, that argument falls apart as the virtual servers become bigger and consume the majority or all of a hypervisor host's resources.  At the extreme, the hypervisor becomes just an abstraction layer and adds little value except the ability to migrate the guest VM.  If anything, the additional layers and complexity seem to increase the occurrances of outages and service degredation.  Plus, to increase the CPU core count, the cores' speed drops.

Elasticsearch is designed to run well on commodity hardware with no component-level redundancy.  While it is experiencing no faults, it is able to fully utilize its hardware by spreading the load amongst all the working nodes that have access to shards of the data being searched.  If a failure occurs, it still works, although performance is degraded; first because there is less hardware to spread a search across, and second because there is likely to be some action in progress to restore redundancy and survivability kicked off as a result of the failure.  This is the core of elasticsearch's node-level redundancy, in contrast to the component-level redundancy favored by most enterprise IT groups.

But let's face it; modern hardware is very reliable, and failures are rare.  Most failures occur because of user error or data center issues such as a power interruption or network faults external to the server.  Or fire or water damage...

Given all these issues, I wish I could convince IT that we're better off with physical servers for our Elastic stack.  Given that high availability clusters start at three nodes do avoid split-brain, here's what I envision as the perfect hardware for that task.  Hopefully the design is dense enough that IT wouldn't have anything to complain about.

I want a 1RU rack-mount chassis with hot-swappable slots on the front for three 1/3rd-width physical node trays,.  Each node tray would provide:

  • A mainboard with a mid-level AMD EPYC CPU (max 180W?)
  • x8 RDIMM slots (or 12?), enough for 512GB if using 64GB per
  • All the usual mainboard features, like TPM, southbridge,  etc.
  • A modular storage cage that comes in multiple flavors, each one with front-accessible USB/VGA ports and power/reset buttons, 2x internal M.2 SSDs similar to Dell's BOSS, and one of:
    • 1x hot-plug LFF SAS/SATA HDD (M.2s under HD)
    • 2x hot-plug SFF SAS/SATA HDDs/SSDs
    • 2x hot-plug U.2 SFF NVMe SSDs
    • 4x hot-plug E3.S/L EDSFF NVMe SSDs

The backplane would connect the three front trays and the back side of chassis which would provide:

  • central lights-out mgmt (eg IPMI) for the chassis/nodes
  • 6x hot-plug OCP 3.0 SFF NIC slots, 2x per node
  • 6x hot-plug E3.S/L EDSFF NVMe slots, 2x per node
  • two or three hot-plug (n+1 redundant) PSUs
The storage controller can either be on the mainboard with SAS/SATA/NVMe cables to the drives, or on the storage cage itself, connected to a PCIe slot on the mainboard.  Being on the cage would mean swapping between SAS/SATA and NVMe is one FRU rather than having to replace both the cage and a controller on the mainboard.  In any case, RAID is not strictly required, as the root device can use software RAID if you really want it, but remember that these are designed for node-level rather than component-level redundancy.  I generally favor a mirror for the OS disks (and redundant PSUs), even with node redundancy.

Regarding compute density, you're looking at a total of three sockets per chassis, with up to 32 cores each at 2.5GHz assuming a maximum TDP of 180W as I write this.  At 96 2.5GHz cores per RU, or 192 threads, that's better than many dual-socket 1RU systems currently out there.  With multiple sockets, you can keep both high core count and high core speed.  You have up to 3 LFF per RU, not far off the usual 4, or up to 6 SFF per RU, a good portion of the usual 8 or 10.  And that does not count the additional internal M.2 drives or the rear EDSFF slots.  With NVMe and EDSFF, there would be 6 E3.S/L per node with 4 up front and 2 in back, and 18 is close to the usual max of 20 per RU.  With thermal limits on the CPU, and no additional room for expansion cards, these would not be heat monsters, so 120VAC PSUs in the 1100W range should suffice.  It may even be possible to modularize the back EDSFF slots and have an optional cage that offers 3x LP PCIe slots instead of 6x EDSFF.

To sum it up, a chassis with 1/3rd-width nodes could scale from a single cluster of three servers in a single RU up to a huge farm running three or more racks with nearly 200 nodes per rack.  It's economical in that it doesn't require hypervisor licensing, oodles of redundant disk, or exotic network gear.  It is not even specific to elasticsearch, so could be used for other high-density server use cases featuring node-level redundancy, but with a count of 3 nodes per RU, hits what I believe to be a sweet spot between high compute density and still having enough physical space for meaningful per-node local storage.  With the option for a LFF HDD, the current and upcoming crop of 20+ TB drives have a place for use as cold storage servers.  

Not all systems require a TB of memory an 128 cores; there is still a place for small and dense physical servers.  Dell, make it so!

Thursday, February 13, 2020

Running Linux VMs in MacOS

Recent versions of MacOS come with the hypervisor framework which makes virtualization possible.  Not necessarily easy, but possible.  Here are the steps I took to get a raw CentOS8 VM running on my system:

1. Install Multipass from Canonical.  This is geared towards running Ubuntu images, so if that's all you're after, you're done!  I use the hyperkit executable it includes to launch my CentOS image rather than trying to compile hyperkit or xhyve myself.
  https://multipass.run/
Note: The version of hyperkit that comes with Multipass cannot boot kernels newer than 4.16 at the time of writing.  Once Ubuntu 20.04 comes out, I'm hoping that will change, which will allow us to use it on CentOS8 kernels as well.  Until then, this cannot be used to install CentOS8.

2. Download your Linux install iso.  I downloaded the NetInstall iso from http://mirror.ash.fastserv.com/centos/7.7.1908/isos/x86_64/

3. While you're waiting on the download, create your sparse virtual disk.
I usually create a 16G sparse file for that.  16G is 16777216 kibibytes.
  dd of=centos7.img bs=1024 count=1 seek=16777215 < /dev/zero
To see that this doesn't actually consume 16G of disk, use the -sl option s for ls, which should show 8 512-byte sectors are allocated. You can also use mdls to see the physical disk size, which should show a single 4k block is allocated.
4. Mount the install image.  This is a little complicated.  Because this is a bootable image, the first few sectors with the MBR need to be blanked out before hdiutil recognizes the ISO9660 partitions it contains.  However, we'll need the MBR later when we go to actually install from it.  Fortunately, via APFS, we have a way to create the modified image without taking up twice the disk space.  This uses the clonefile API call via the 'cp -c' option so changes are copy-on-write (COW).
  ln -s ~/Downloads/CentOS-7-x86_64-NetInstall-1908.iso netinst.iso
  cp -c netinst.iso tmp.iso
  dd if=/dev/zero of=tmp.iso bs=2048 count=1 conv=notrunc
  hdiutil attach tmp.iso
Make note of the mount point it displays and the path of its device file for use in the next step

5. Copy the installer kernel and initrd and clean up.  You may have to poke around in the mounted filesystem for the right location.  For the NetInst install disk, it was in the isolinux directory.
  cp /Volumes/CentOS\ 7\ x86_64/isolinux/{vmlinuz,initrd.img} .
  hdiutil eject /dev/disk2
  rm tmp.iso

6. Launch the installer. 
  sudo "/Library/Application Support/com.canonical.multipass/bin/hyperkit" -A -c 1 -m 2G -s 0,hostbridge -s 1,lpc -l com1,stdio -s 3,virtio-net,eth0 -s 4,ahci-cd,netinst.iso -s 2,virtio-blk,centos7.img -f "kexec,vmlinuz,initrd.img,earlyprintk=serial console=ttyS0"


- setup network first, take note of DHCP-assigned gateway (eg. 192.168.64.1)
- use installation source of http://mirror.centos.org/centos/7/os/x86_64/
- I find it easiest to set the install destination as a standard partition, YMMV
- Do not reboot when prompted after the system is installed

7. Drop to a shell before reboot to save your kernel/initrd files.  In Terminal.app, toggle Edit / Use Option as Meta Key, then use Option-Tab to switch between 5 virtual terminals, one of which is a root shell.  Copy the system's installed kernel/initrd to host (from root shell in installer environment).
  scp /mnt/sysimage/boot/vmlinuz* /mnt/sysimage/boot/initramfs* youruser@192.168.64.1:your/path

8. Complete install.  Option-Tab back to the installer prompt and reboot to complete the installation.  Uncheck Use Option as Meta Key in Terminal.app before you forget...

9. Using the vmlinuz-? and initramfs-? files you copied right before completing the install, boot into your new system.
  sudo "/Library/Application Support/com.canonical.multipass/bin/hyperkit" -A -c 1 -m 2G -s 0,hostbridge -s 1,lpc -l com1,stdio -s 3,virtio-net,eth0 -s 2,virtio-blk,centos7.img -f "kexec,vmlinuz-3.10.0-1062.el7.x86_64,initramfs-3.10.0-1062.el7.x86_64.img,earlyprintk=serial console=ttyS0 root=/dev/vda3"

10. Whenever you update the kernel, remember to copy the updated vmlinuz and initramfs files onto the host to use on the next boot.

Friday, June 7, 2019

Playing with cloned sparse files on APFS

I've been experimenting a bit with APFS and its ability to create and clone sparse files.

Let's get a baseline of how many 4k blocks are available on my SSD:
bash-3.2$ export BLOCKSIZE=4096
bash-3.2$ df . | awk '{print $3}'
Used
592952
bash-3.2$
Here I create a 40MiB sparse file.  Note that it only takes one 4k block on disk.
bash-3.2$ dd bs=4096 count=1 if=/dev/random of=test seek=10240 2>/dev/null
bash-3.2$ df . | awk '{print $3}'
Used
592953
bash-3.2$ ls -l test
-rw-r--r--  1 pete.nelson  admin  41947136 Jun  7 12:30 test
bash-3.2$ mdls test | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 4096
bash-3.2$
Now let's add some sparse data within the file at three different offsets.  Note that it only grows by three 4k blocks; the amount of data that was added.  Everything that remains has never been written to, so is the void portion of the sparse file.
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=test conv=notrunc seek=1000
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=test conv=notrunc seek=2000
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=test conv=notrunc seek=3000
bash-3.2$ mdls test | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 16384
bash-3.2$ df . | awk '{print $3}'
Used
592956
bash-3.2$
Now let's clone the sparse file!  Note that -c is an Apple (actually BSD) feature that calls clonefile() instead of copyfile().  This depends on the underlying filesystem being capable of copy-on-write (COW) operations, which APFS is.  Cloning it takes no more additional space on the volume.
bash-3.2$ cp -c test foo
bash-3.2$ df . | awk '{print $3}'
Used
592956
bash-3.2$
However, mdls doesn't differentiate between the two files, and reports that each one takes 16k of physical space on the drive.  While I'm not ready to call this erroneous, it's not entirely accurate.
bash-3.2$ mdls test | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 16384
bash-3.2$ mdls foo | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 16384
bash-3.2$
Now let's modify both files and see when blocks are allocated. Writing to offset 3000 on the original file triggers COW on that block, and an additional block is allocated for the original file.  However, writing to that same offset on the clone after that does not take any more space.  That block belongs just to the cloned file, so it doesn't trigger a COW.  Slick!

bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=test conv=notrunc seek=3000
bash-3.2$ df . | awk '{print $3}'
Used
592957
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=foo conv=notrunc seek=3000
bash-3.2$ df . | awk '{print $3}'
Used
592957
bash-3.2$
 So far, we've used 5 blocks.  Three are shared, and the two remaining are specific to each file.

Now let's clone the clone.  As before, no additional space is required.
bash-3.2$ cp -c foo bar
bash-3.2$ df . | awk '{print $3}'
Used
592957
bash-3.2$
Now let's really mix things up and write to four offsets, one of which is still shared between the original two, one is not, and two of which are still sparse.  Each one allocates an additional 4k block as expected.  When 3000 is written, it triggers COW since that's shared with foo.  When 1000 is written, it triggers COW since that's shared with all three, and should leave the original still shared between test and foo.
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=bar conv=notrunc seek=3000
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=bar conv=notrunc seek=4000
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=bar conv=notrunc seek=5000
bash-3.2$ 2>/dev/null dd bs=4096 count=1 if=/dev/random of=bar conv=notrunc seek=1000
bash-3.2$ df . | awk '{print $3}'
Used
592961
bash-3.2$

mdls is still confused.  Between the three files, it shows a total of 14 blocks used, but in reality there are only 9 allocated (Used climbed from 592952 to 592961).

bash-3.2$ mdls test | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 16384
bash-3.2$ mdls foo | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 16384
bash-3.2$ mdls bar | tail -2
kMDItemLogicalSize                 = 41947136
kMDItemPhysicalSize                = 24576
bash-3.2$

How much does each file really take?
Test offsets: 1000 is shared with foo, 2000 and 10240 are shared with all three, 3000 is owned.

Foo offsets: 1000 is shared with test, 2000 and 10240 are shared with all three, 3000 is owned.
Bar offsets: 1000 is owned, 2000 and 10240 are shared with all three, 3000, 4000 and 5000 are owned.

bash-3.2$ df . | awk '{print $3}'
Used
592961
bash-3.2$ rm test
bash-3.2$ df . | awk '{print $3}'
Used
592960
bash-3.2$ rm foo
bash-3.2$ df . | awk '{print $3}'
Used
592958
bash-3.2$ rm bar
bash-3.2$ df . | awk '{print $3}'
Used
592952
bash-3.2$
So deleting test freed its owned block at offset 3000.  Deleting foo freed offset 3000, but also offset 1000 which it had shared with test previously.  And deleting bar freed all 6 blocks, since all clones had been deleted and it now owned all blocks.

My question is, short of deleting a cloned file, how can I determine how much space it claims to itself?  Also, how can I determine the other files a clone is associated with and what percentage of each file it has in common?

Is this someone's chance to develop a tool

Wednesday, March 13, 2019

Serial over network via socat

From http://www.anites.com/2017/11/socat.html:

on remote:
  socat tcp-listen:8000,reuseaddr,fork \
    file:/dev/ttyUSB0,nonblock,waitlock=/var/run/tty0.lock,b115200,raw,echo=0

on local:
  socat pty,link=/dev/ttyUSB0,waitslave tcp:pi.local:8000
  tio -b 115200 /dev/ttyUSB0

Tuesday, January 22, 2019

Making the most of APFS and xhyve

I'm running macOS Mojave, and using it to host virtual machines via xhyve.  There are some neat tricks one can use to conserve disk space while giving plenty of room to your VMs.

First, a side-thread:  I read about an issue someone had when APFS first came out.  I don't know for sure if it's still an issue, and haven't been able to find the article for this post.  The gist was, if the user fills up the root volume, there's no way to delete any files in recovery mode when there are no free extents in APFS.  The workaround involves creating a throw-away APFS volume to reserve some free space.  Then, if root ever fills up, boot into recovery, delete the throw-away volume to free up a few extents, then mount root and clean it up as needed.

When creating an Ubuntu VM from the install ISO, I use the technique explained at https://gist.github.com/mowings/f7e348262d61eebf7b83754d3e028f6c.  One has to extract the installer's initramfs and kernel image to pass to xhyve.  The often cited way is to copy the iso to a temporary file and zero out the first couple of sectors before mounting that to extract the files.  There is a way to do that using APFS's COW so you're not taking twice the disk space for the ISO.

I combine these two efforts by creating my throw-away APFS volume (with at least 500MB reserved), and storing the ISO to that, as I can always download it again if I accidentally fill up root.  Then, to make a COW copy of the ISO, duplicate it using Finder (or cp -c at bash prompt, which uses the clonefile() call rather than read/write of the file contents).  You'll see that duplicating this large file on the small volume does not increase the amount of space used on that volume!  Then just overwrite the first couple of blocks with this command:
dd if=/dev/zero of=/Volumes/DeleteIfRootFull/tmp.iso bs=2048 count=1 conv=notrunc
Now you you can keep that temporary ISO around for the next OS install, or even script the mountable ISO creation and boot right from the files on the mounted filesystem.  Note that the conv=notrunc argument is important, as that is what keeps the remaining file intact while we overwrite the first 2KiB with zeroes.

You can use a similar trick with dd to create your VM disk as a sparse file.  Even if you have less than 16GB free, you can create a 16GB or larger drive to install your OS into by seeking within the output file to just the final block.  For example, to calculate your seek size for a 16GiB disk, run this command:
echo '16 1048576*1-p' | dc
I got 16777215, which I use in the following command:
dd if=/dev/zero of=hdd1.img bs=1024 count=1 seek=16777215
You'll see the resulting file is exactly 2^34 bytes (16GiB), but if you go to the folder in Finder and view the file's info, you'll see its size as "17,179,869,184 bytes (4 KB on disk)".  Note the difference in listed size and size "on disk" indicates that this is a sparse file (only the extents that have been written to are actually allocated "on disk").  It's 4k instead of 1K because flash writes a minimum of 4K at once, I believe.

Be warned that, as you write to this file, you will be increasing the space used on your APFS volume, and your xhyve VM could try to write up to the file's given size, even if you don't have that much free space available.  So, if you don't monitor your free space, you could end up needing to delete that extra APFS volume sooner than you had expected...

Thursday, August 18, 2016

Sed one-liners

I may occasionally publish small notes on clever commands I learn about.  Putting it here helps me store knowledge that my shoddy personal data management practices might otherwise lose...  One such note is a one-line sed command to print out the Linux interface(s) which handles the default route:
sed -n 's/\(^[^\t]*\)\t00000000.*/\1/p' /proc/net/route
An explanation, from left to right: Don't print each line (-n), prepare for a substitution ('s/), look for a string of non-tab characters at the beginning of the line (^[^\t]*) while saving the results (the \( and \) parts surrounding that), followed by a single tab and a string of eight 0s (\t00000000), followed by anything to gobble up the rest of the line (.*), then substitute it all with the non-tabs string saved earlier (/\1/), and print it (p').  The eight 0s represent the default route of 0.0.0.0/0.

To be really specific the eight 0s specify a route for a network of undetermined size starting at 0.0.0.0.  For the true default route, I should also check for a mask of 00000000, as OpenVPN sometimes adds two net routes (0.0.0.0/1 and 128.0.0.0/1) to avoid the need to replace the existing default route.  This command will find anything starting at 0.0.0.0 as a default route, which may or may not be what you want...  In reality, since 0.0.0.0/8 is reserved, if a route starts at zero, it's pretty defaulty anyway..

Labels: , ,