Category Archives: Linux

gordon: the simple chef workflow

Coding Devops Linux Open Source

Nikolas Lahtinen, one of our top full-stack devops developers, wrote on his blog http://nikolas.ninja/gordon-simple-chef-workflow/

I was working on provisioning some Docker containers with Chef.

After finding the knife commands to be verbose to a degree of extreme annoyance and knife not managing the simple task of unpacking it’s own groceries, I decided there was need for an actual Chef so I called Gordon Ramsay over!

XVnCYozNo more annoying fumbling with the knife as you try to remember whether it was cookbook site download or site download cookbook or whatever. No more combing through metadata.rb and metadata.json files for dependencies. Need to create new repository? “Where was that example repo again? I swear I had the url somehere in my emails…” Don’t worry, Gordon has got you covered!

Gordon – the apt-get of Chef

To be clear, gordon is a scaffolding app used to get you out up and running in matter of minutes. Perfect for kickstarting the provisioning files for chef-solo environments such as the virtualized kinds Docker and Vagrant provide. At the moment gordon has no abilities to communicate with Chef server other than generate static files.

 

 

Read more from:

http://nikolas.ninja/gordon-simple-chef-workflow/

 

Published by:

xm delete domain is not halted

Linux

For years I have had a problem with a few stale xen virtual servers. After some crashes, I have been unable to delete or stop the domains. When I try to run xm delete, I got this error message:

root@shire:/var# xm delete domain.phz.fi
 Error: Domain is not halted.
 Usage: xm delete
Remove a domain from Xend domain management.

However the domain is not running, and you cannot xm reboot, restore, reboot or shutdown it.

root@shire:/var# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 3354 4 r----- 3485.8
domain.phz.fi 256 1 11054.4
domain2.phz.fi 368 1 0.0

The only way I’ve been able to start the domain again was to create a second copy (of the config file) by name domain2.phz.fi.

Googling “xm delete domain is not halted” has not been helpful, since the top results like this and this won’t really work.

But now I finally found out how to fix the issue: xend stores the domain run-time data to /var/lib/xend/domains. For example I found two saved states of the domain refusing to halt:

root@shire:/var# grep -r domain.phz.fi /var/lib/xend/domains/ 
/var/lib/xend/domains/1fada935-a352-daac-7142-d48ffc3dfc51/config.sxp: (name_label domain.phz.fi)
/var/lib/xend/domains/1fada935-a352-daac-7142-d48ffc3dfc51/config.sxp: (name domain.phz.fi)
/var/lib/xend/domains/1fada935-a352-daac-7142-d48ffc3dfc51/config.sxp: (uname file:/home/xen/domains/domain.phz.fi/disk.img)
/var/lib/xend/domains/1fada935-a352-daac-7142-d48ffc3dfc51/config.sxp: (uname file:/home/xen/domains/domain.phz.fi/swap.img)
Binary file /var/lib/xend/domains/1fada935-a352-daac-7142-d48ffc3dfc51/checkpoint.chk matches
/var/lib/xend/domains/9f174da8-26cc-660b-0793-f493aac65b7f/config.sxp: (uname file:/home/xen/domains/domain.phz.fi/disk.img)
/var/lib/xend/domains/9f174da8-26cc-660b-0793-f493aac65b7f/config.sxp: (uname file:/home/xen/domains/domain.phz.fi/swap.img)

Now you can just

rm -rf /var/lib/xend/domains/9f174da8-26cc-660b-0793-f493aac65b7f

After which

xm delete domain.phz.fi

works again.

Published by:

Error storing directory block information Memory Allocation failed

Linux

Unfortunately my large disk partition was corrupted (maybe because it was 100% full a few months, while quite a few processes wrote more 0 byte files on it). The mount fails with this error message:

root@server:/mnt$ mount /dev/md127 md/
mount: wrong fs type, bad option, bad superblock on /dev/md127, missing codepage or helper program

or other error In some cases useful info is found in

syslog - try dmesg | tail 

or so.
Then I found this link suggesting running

e2fsck -y -f -v -C 0 /dev/sda3

But unfortunately it fails with the following error message:

Error storing directory block information (inode=14109880, block=0, num=471166008): Memory allocation failed

Solution

debugfs -w -R "clri <14109880>" /dev/vg/root

From: http://www.spinics.net/lists/linux-ext4/msg42703.html Then rerun the e2fsck from above.   UPDATE I wrote this script to do the clearing – recheck -cycle automatically, since otherwise manually it might take weeks to complete:

#!/bin/bash

#TODO run first mkfs.ext3 -n $DEVICE to find out some backup -superblock
SUPERBLOCK=11239424
DEVICE=/dev/md127

function clear_inode() {
 local INODE=$1
 debugfs -w -R "clri <${INODE}>" $DEVICE
}
function check_inode() {
 local INODE=`e2fsck -b $SUPERBLOCK -y -f -v -C 0 $DEVICE|grep "Error storing directory block information" |cut -d ' ' -f6 |cut -d '=' -f2 |cut -d ',' -f1`
 #Case1: parse INODE from "Error storing directory block information (inode=14109880, block=0, num=471166008): Memory allocation failed"
 echo "$INODE"
}

function is_int() {
 if $(test -z $@) ; then
 return 1 #false
 fi
 return $(test "$@" -eq "$@" > /dev/null 2>&1);
}


INODE=$(check_inode)
echo "Cheking $INODE"
OLDINODE=$INODE
#check whether INODE is integer
while $(is_int "${INODE}") ; do
 echo "Clearing inode $INODE"
 clear_inode $INODE
 INODE=$(check_inode)
 echo "New inode $INODE"
 if [ "$INODE" -eq "$OLDINODE" ]; then
 echo "ERROR Looping the same inode"
 exit 2
 fi
 OLDINODE=$INODE
done

Update

Published by:

GRUB loading stage 1.5 GRUB loading, please wait

Hardware Linux Open Source

After an unusually long power outage (more than 1 hour which was longer than our UPSes could manage) our company main file server running Ubuntu 10.04 Lucid LTS has been down for two weeks in a row, with multiple issues with broken HDDs, corrupted RAID arrays etc, but one problem is that the computer doesn’t even boot normally (or at all). After fixing some other boot problems, I got stuck to this Grub error message:

GRUB loading stage 1.5
GRUB loading, please wait…

For a note, this is a distinct error message displaying no error code.

Grub seems to have plenty of different kinds of error messages, but the Gentoo Wiki has managed to make a good summary of them all. The same problem has tormented other distros such as Redhat and PCLinuxOS, too.

I tried out the usual tricks, such as running grub-install, update-grub, then grub

root (hd0,0)
setup (hd0)

but nothing seemed to work. Finally I started to rip the 10+ hard disk cables out one-by-one, and after taking out all SATA drives I was left with one PATA/IDE drive only. For a note, the server is actually a dual-Pentium3 with Abit VP6 and an old BIOS capable to detect only small PATA drives, the 1.5TB – 2TB drivers are way too big for the old 32bit BIOS to manage 🙂

Suddenly when there was only one drive left, the grub menu appeared and the error went away!

It seems that the problem was caused by the SiI3512A SATA Raid -controller card. The system had three SATA controller expansion cards, 2x 4-port Promise and 1x 2-port SiI3512A -card. After plugging the cables off the SiI3512A -controller, the system didn’t anymore try to boot from the SATA disks (which the BIOS can’t comprehend), but from the old PATA-drive.

Another option could have been to wipe grub away from the two SATA disks that were connected to the SiI3512A, but since there were empty slots available on the Promise-cards, I just took away the semi-working card. It was an miracle, that the BIOS had detected Grub on the SATA disks, but in the end it just caused a very difficult to fix problem, so I’d rather had just a non-booting SATA controller card instead.

Anyway, at least the boot problem was solved.

Published by:

ipa Could not initialize GSSAPI: Unspecified GSS failure. Minor code may provide more information No credentials cache found

Company Linux

While setting up FreeIPA I got this error message:

[root@ldap]# ipa-pwpolicy --show
Could not initialize GSSAPI: ('Unspecified GSS Failure. Minor code may provide more information', 851968)/('No credentials cache found', -1765328189)

The solution was not pointed out directly, but the Free IPA documentation said that you should first log on Kerberos

kinit admin

then the following commands should work:

ldapsearch -Y GSSAPI -b "dc=phz,dc=fi" uid=admin

ipa-pwpolicy --show

Published by:

How to change CD ISO in Xen

Company Linux

I was setting up some virtual machines by using Xen and virt-manager, but was stuck with being unable to change CDs when the installer asked to put in the CD number 2.

After some googling and dead-ends I found finally this approach that worked. First find out the ID of your virtual machine
root@cardolan:/var/lib/libvirt# xm list
Name ID Mem VCPUs State Time(s)
Domain-0 0 4886 4 r----- 402051.2
mosaic 800 1 123.6
test-i386-redhat-7.2-template 27 800 1 -b---- 103.6
test-i386-ubuntu-8.04-firefox-1.0 24 800 1 --p--- 132.8
test-ubuntu8.04-mozilla19990128 800 1 115.8
ubuntu804-template 800 1 7.8
winxp 1024 1 18674.4

The one that we are setting up is the ID 27. Then check out what disks the host is using:
root@cardolan:/var/lib/libvirt# xm block-list -l 27
(768
((backend-id 0)
(virtual-device 768)
(device-type disk)
(state 1)
(backend /local/domain/0/backend/vbd/27/768)
)
)
(832
((backend-id 0)
(virtual-device 832)
(device-type cdrom)
(state 1)
(backend /local/domain/0/backend/vbd/27/832)
(eject eject)
)
)

We are interested of the find out wich device-type cdrom. Copy the backend path i.e. /local/domain/0/backend/vbd/27/832 . Then add /params to the end to see the current mounted iso
root@cardolan:/var/lib/libvirt# xenstore-read /local/domain/0/backend/vbd/27/832/params
/home/downloads/seawolf-i386-disc2.iso

Change the ISO
xenstore-write /local/domain/0/backend/vbd/27/832/params /home/downloads/seawolf-i386-disc2.iso

and voilà you can continue the installation!

Published by:

Failed to read last sector (2930272255): Invalid argument

Linux

Today I tried to fix a broken NTFs hard disk. I was able to read parts of the disk by
dd if=/dev/sdc1 of=rescue.dd.img conv=noerror,sync

When the disk started to give too many errors, I skipped some gigabytes and continued to read from further away (note by default dd reads 512bytes at a time)
dd if=/dev/sdc of=rescue.dd.img skip=10000000 seek=10000000 conv=noerror,sync

To make the rescue.dd.img equal size to the original (added 1.3TB), I ran truncate, which was instant unlike dd
truncate -s +1329705307136 rescue.dd.image

Then I ran
ntfsfix -f rescue.dd.img
But then I got error
> mount rescue.dd.img /media/windows
Failed to read last sector (2930272255): Invalid argument
HINTS: Either the volume is a RAID/LVM but it wasn't set up yet,
or it was not setup correctly (e.g. by not using mdadm --build ...).
or a wrong device is tried to be mounted,
or the partition table is corrupt (partition is smaller than NTFS),
or the NTFS boot sector is corrupt (NTFS size is not valid).
Failed to mount '/dev/loop0': Invalid argument
Maybe the wrong device is used? Or the whole disk instead of a
partition (e.g. /dev/sda, not /dev/sda1)? Or the other way around?

Thanks to this article I managed to mount the disk

sudo ntfs-3g -o force,rw rescue.dd.img /media/windows

Success!

Published by:

Ubuntu X Server does not start

Linux

I upgraded my Ubuntu to 11.04 but unfortunately the X did not start anymore. I got the Login screen up, but after logging in the xserver seemed to crash and return to the login screen. After upgrading to kernel 3.0.0.13 or higher I was also unable to switch to the console mode by ctrl-alt-F7 / F6 due to non-supported screen resolution (this could be fixed by uncommenting from /etc/default/grub #GRUB_TERMINAL=console and then running sudo update-grub) . I tried a bunch of different tricks such as ones described here but without success. I was looking to the

grep “(EE)” /var/log/Xorg.0.log

and tried to fix for example the issues with /dev/fb0 not found, drivers noveau and nv missing. However, I checked my other well functioning machine, and that had the same error messages, so they were not the cause of the problem.

Finally I also took a look in the other log files and found the problem:
Mar 27 23:29:29 server pulseaudio[1789]: [autospawn] core-util.c: Failed to create random directory /tmp/pulse-oLhPPjki4YeE: Permission denied
Mar 27 23:29:29 server pulseaudio[1789]: [autospawn] core-util.c: Failed to symlink /var/lib/lightdm/.pulse/444ae23e9ad59364ceaf30c200000006-runtime.tmp: Permission denied

The /tmp dir had too strict permissions! This was fixed quickly by:
sudo chmod a+w /tmp

Published by:

Jenkins git remote slave problem

Linux Software Engineering

Today I came up with an issue with Jenkins Continuous Integration (v.1.412) server when I was trying to fetch a git repository on a remote slave server (Red Hat). I found a few similar issues and open bugs, but none of them were directly related to Linux slaves. I got this error message dump:

Started by user jenkins
Building remotely on server01
Checkout:MyProject / /home/jenkins/workspace/MyProject – hudson.remoting.Channel@5352c503:server01
Using strategy: Default
Last Built Revision: Revision ebb0c40a1a321a00d8176e25aa81364efaac702f (origin/master)
Checkout:MyProject / /home/jenkins/workspace/MyProject – hudson.remoting.LocalChannel@59ab12f8
Fetching changes from 1 remote Git repository
Fetching upstream changes from ssh://git.server/var/repos/git/MyProject
ERROR: Problem fetching from origin / origin – could be unavailable. Continuing anyway
ERROR: (Underlying report) : Error performing command: git fetch -t ssh://git.server/var/repos/git/MyProject +refs/heads/*:refs/remotes/origin/*
Command “git fetch -t ssh://git.server/var/repos/git/MyProject +refs/heads/*:refs/remotes/origin/*” returned status code 128: error: cannot run ssh: No such file or directory
fatal: unable to fork

ERROR: Could not fetch from any repository
FATAL: Could not fetch from any repository
hudson.plugins.git.GitException: Could not fetch from any repository
at hudson.plugins.git.GitSCM$2.invoke(GitSCM.java:1008)
at hudson.plugins.git.GitSCM$2.invoke(GitSCM.java:968)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:1956)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at hudson.remoting.Engine$1$1.run(Engine.java:60)
at java.lang.Thread.run(Thread.java:636)

The issue was resolved by adding correct location of the ssh -command (on my server /usr/bin) to the PATH of slave node’s environment settings (on Jenkins).

Published by: