Blog from July, 2020

An eventful day. Besides upgrading Jira and Confluence, I also do upgrades of the underlying Ubuntu operating system. Today I had 8 AWS EC2 instances to upgrade, all running Ubuntu 16.04.6 LTS.

I did the usual steps, upgrading the sandboxes first:

apt-get update
apt-get upgrade
reboot

3 of the 4 sandboxes failed to boot!  They were all stuck on Grub rescue screen, showing:

Booting from Hard Disk 0...
error: symbol `grub_calloc' not found. Entering rescue mode...
grub rescue> _



Googling yielded this askubuntu.com post, which provides a general way forward: we need to reinstall grub with grub-install <disk> .

How to reinstall grub on AWS

The joyous thing about AWS is you only get a screenshot  of the grub rescue>  prompt. You can't actually rescue anything. (Edit: this wasn't a MacGuyver-rescue'able situation anyway.)

For AWS the process is:

  1. Launch a recovery t2.micro in the same AZ/subnet as your borked instance(s).
  2. Stop the broken instance and detach the root volume (the one containing the OS).
  3. Attach the root volume to the recovery instance as /dev/sdf:



  4. Now run these commands:

    mount /dev/xvdf1 /mnt
    for fs in {proc,sys,tmp,dev}; do mount -o bind /$fs /mnt/$fs; done
    chroot /mnt
    lsblk
    grub-install /dev/xvdf

    It looked like this:

  5. Then:

    exit
    for fs in {proc,sys,tmp,dev}; do umount /mnt/$fs; done
    umount /mnt

    (Edit: added fs umounts per this LP comment)

  6. Detach the volume from the recovery instance
  7. Attach the volume back on the original instance.
  8. Reboot original instance

Then you're golden.

What caused the problem?

Edit: rewritten 31/July based on lp#1889509 comments)

On the 3 servers that broke, apt-get update  did not prompt me for anything grub-related. Checking afterwards with debconf-get-selections  I see that debconf was pre-configured to install grub on one device:

grub-pc       grub-pc/install_devices_disks_changed   multiselect     /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol01e713272eb256s52-part1

That device symlink is wrong. It was either pre-seeded by cloud-init or by the AMI creator (I'm not sure). My servers need grub on /dev/xvdf, not /dev/xvdf1 . The grub postinst script would have encountered the same failure I saw while recovering:

# grub-install /dev/xvdf1
grub-install: warning: File system `ext2` doesn't support embedding.
grub-install: warning: Embedding is not possible.  GRUB can only be installed in this setup by using blocklists.  However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: will not proceed with blockists.

I must not have noticed this error in the wall of text scrolling past on the upgrade.

Grub should have failed hard, but didn't (now been filed as https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1889556/ ).

The result: grub is upgraded in my /boot  partition, but the grub loader in /dev/xvdb  is still old, and the mismatch causes the failure (hat tip to ~juliank on lp#1889509).


To reinforce this theory, we turn to 1 of the 4 sandboxes did not break. When I did apt-get upgrade  the surviving server had previously demanded I give it some 'GRUB install devices':

I had decided to take debconf's advice, and installed grub on all devices:

and that saved me, at least on this one server.


But why did this update go wrong?

I upgrade OS packages on these servers every month, but looking at my /var/log/dpkg.log history, grub is very rarely updated. The last update was a full year ago:

root@usw-jira01:/var/log# zgrep 'upgrade grub-pc:amd64' dpkg.log*
dpkg.log:2020-07-30 04:06:23 upgrade grub-pc:amd64 2.02~beta2-36ubuntu3.23 2.02~beta2-36ubuntu3.26
dpkg.log.12.gz:2019-07-15 02:04:55 upgrade grub-pc:amd64 2.02~beta2-36ubuntu3.20 2.02~beta2-36ubuntu3.22
dpkg.log.7.gz:2019-12-09 03:06:58 upgrade grub-pc:amd64 2.02~beta2-36ubuntu3.22 2.02~beta2-36ubuntu3.23


So I posit that there is nothing directly wrong with this grub update specifically, but rather I (and a lot of other people) are hitting a problem in the general Debian grub update process; specifically, when grub fails to update with an error:

  
grub-install: warning: File system `ext2` doesn't support embedding.
grub-install: warning: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: will not proceed with blockists


it should fail hard, but instead proceeds.

How do I know if my server will break?

If your system boots with UEFI, you're fine.

If you're on Linode with default settings, you're fine.

For BIOS users, including AWS and other VPS hosters, run the following:

cd /tmp
curl -LOJ https://gist.github.com/jefft/76cf6c5f6605eee55df6079223d8ba1c/raw/bf0985bdb1e2ef5fc74e2aee7ebf29c4eaf7199f/grubvalidator.sh
chmod +x grubvalidator.sh
./grubvalidator.sh

This script checks if your grub version includes a fix for lp #1889556 and if not, checks if you are likely to experience boot problems.


I just upgraded a 20.04 Ubuntu laptop without drama, but those all use BIOS, not EFI. So far I've only seen this affect BIOS users.

Edit: that is correct. Per  ~juliank on lp #1889509:

This is not a problem on UEFI systems fwiw, as they do not use a small image in MBR and load the rest from /boot, but a single monolithic grub image in the ESP.

So here's my litmus test: run:

sudo apt-get install debconf-utils
sudo debconf-get-selections | awk '$1=="grub-pc" && $2 == "grub-pc/install_devices" {print $4}'

If you see nothing, you're fine: either:

  • you're on EFI and unaffected
  • you're on BIOS, but will be forcefully prompted to pick a device, as I was.

If you get a single device back:

sudo debconf-get-selections | awk '$1=="grub-pc" && $2 == "grub-pc/install_devices" {print $4}'
/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol01e713272eb256s52-part1

then you may be in trouble (not always - see below re. Linode). You'd best run dpkg-reconfigure grub-pc before rebooting, and install Grub on some other devices to be safe.

Linode

~mnordhoff on lp #1889509 says:

On Linode, I believe you should be safe unless the kernel is set to "Direct Disk". It doesn't matter if the GRUB installation works if it's not being used.

This is correct in my experience. My own servers are on linode, where the OS disk is /dev/sda and the above command prints:

root@radish-linode:~# sudo debconf-get-selections | awk '$1=="grub-pc" && $2 == "grub-pc/install_devices" {print $4}'
/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_drive-scsi-disk-0,

When I run dpkg-reconfigure grub-pc  and pick /dev/sda  I get a scary warning, replicable from the command-line:

root@radish-linode:~# grub-install /dev/sda
Installing for i386-pc platform.
grub-install: warning: File system `ext2' doesn't support embedding.
grub-install: warning: Embedding is not possible.  GRUB can only be installed in this setup by using blocklists.  However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: will not proceed with blocklists.

However despite all the warning signs, my VM rebooted fine.


Edit: I filed a launchpad bug: https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1889509

Edit: The essential problem is that the grub-pc  package doesn't fail in the presence of bad debconf data. That has been fixed per https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1889556.

Edit2: Updated 'What caused the problem' section with info from comments on the bug.

This page constitutes random notes from my work day as an Atlassian product consultant, put up in the vague hope they might benefit others. Expect rambling, reference to unsolved problems, and plenty of stacktraces. Check the date as any information given is likely to be stale.

A suprising log I noticed on JIRA startup today:

2020-07-11 14:18:29,465+1000 JIRA-Bootstrap ERROR      [c.k.j.p.keplercf.admin.KCFLauncher] Cannot copy the JSP. Error was:java.io.FileNotFoundException: /opt/atlassian/redradish_jira/8.10.0/atlassian-jira/secure/popups/ksil_userpicker.jsp (Permission denied)
java.io.FileNotFoundException: /opt/atlassian/redradish_jira/8.10.0/atlassian-jira/secure/popups/ksil_userpicker.jsp (Permission denied)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
        at com.keplerrominfo.jira.plugins.keplercf.admin.KCFLauncher.copyJSPFile(KCFLauncher.java:346)
        at com.keplerrominfo.jira.plugins.keplercf.admin.KCFLauncher.launch(KCFLauncher.java:188)
        at com.keplerrominfo.refapp.launcher.AbstractDependentPluginLauncher.tryToLaunch(AbstractDependentPluginLauncher.java:139)
        at com.keplerrominfo.refapp.launcher.AbstractDependentPluginLauncher.handleEvent(AbstractDependentPluginLauncher.java:85)
        at com.keplerrominfo.jira.plugins.keplercf.admin.KCFLauncher.onPluginEvent(KCFLauncher.java:215)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.atlassian.event.internal.SingleParameterMethodListenerInvoker.invoke(SingleParameterMethodListenerInvoker.java:42)
        at com.atlassian.event.internal.AsynchronousAbleEventDispatcher.lambda$null$0(AsynchronousAbleEventDispatcher.java:37)
        at com.atlassian.event.internal.AsynchronousAbleEventDispatcher.dispatch(AsynchronousAbleEventDispatcher.java:85)
        at com.atlassian.event.internal.LockFreeEventPublisher$Publisher.dispatch(LockFreeEventPublisher.java:220)
        at com.atlassian.event.internal.LockFreeEventPublisher.publish(LockFreeEventPublisher.java:96)
...
2020-07-11 12:26:36,465+1000 UpmAsynchronousTaskManager:thread-2 ERROR jturner 736x198x1 1uhj4pl 127.0.0.1 /rest/plugins/1.0/updates/all [c.k.j.p.keplercf.admin.KCFLauncher] You must manually copy the ksil_userpicker.jsp file into the correct directory (read the manual). Destination path: JIRA-HOME/atlassian-jira/secure/popups/ksil_userpicker.jsp


It appears this is caused by the SIL Engine plugin:

SIL Engine is attempting to copy a JSP to JIRA's app directory, and failing due to permissions.

SIL Engine is a "library plugin", a dependency of other CPrime plugins, which I had experimented with in the past (Power Custom Fields, I think). The problem with "library plugins" is that they hang around even after the last plugin that used them is uninstalled. Thus; SIL Engine on my system.

Digression: Permissions in your /opt/atlassian/jira directory

The attempted copy failed ('Permission Denied'), and rightly so. JIRA (and any webapp) should absolutely not be allowed to write to its own installation directory. Back in 2009 I had not learned this. I was a volunteer administrator of https://jira.apache.org, and left the app directory writable, which contributed to the server being hacked:

Countless PHP hacks have been enabled to not following this rule, due to apps like Wordpress encouraging the anti-pattern of allowing the app to upgrade itself.

But what about ksil_userpicker.jsp?

There is nothing on the web about ksil_userpicker.jsp, so I contacted CPrime. Developer Radu Dumitriu replied:

That file was necessary because, at the time, Jira didn't have the ability to let us create a user picker panel (3.x). That's the solution we came with, so obviously it remained unchanged. We will change that when we'll publish a major version of the addon (hopefully).

So the answer to both your questions is 'history'

So it looks innocuous. Still, I think it best to ignore the error until usage actually demands this JSP's presence. If, like for me, SIL Engine is a relic, you would be best off uninstalling it.