Contributors mailing list archives

Re: runbot is down

Acsone SA, Stéphane Bidoul
- 11/10/2016 12:19:44
Thanks a lot for the explanation, and the maintenance work, Alexandre.


On Tue, Oct 11, 2016 at 1:07 PM Alexandre Fayolle <> wrote:
On 11/10/2016 10:38, Stéphane Bidoul wrote:
> Hi Alexandre,
> "internal postgres corruption" looks pretty scary. Any idea on the root
> cause or lesson we can learn here?

Maybe the lesson is "be careful when running 1500 databases on your
postgresql cluster with lots of simultaneous connections". And the
second lesson is that the cleanup in the runbot is not very
good/efficient/robust (I'm not sure exactly, but it seems to leave a lot
of crap behind).

I had to fight against systemd which was finding that PG was taking too
much time to startup (replaying the WAL / rebuilding some internal data
structurs was taking a bit of time), and would issue a kill -9, which
was *not* a helpful way of solving things.

In the end I:

* manually cleaned up all the builds
* manually dropped all the databases
* rebooted the servers
* DELETEd the runbot.builds related to the heads of the main branches so
that the rebuild would work correctly by recreating a build environment
from github)

And since this was still failing in lots of cases, I just went through a
2h pdb session to find out a missing fix which I just applied on the
runbot and seems to fix the builds on the v10 branch (this was
introduced by the merge of the upstream branch of odoo-extra in our

Things seem to be getting back to normal now.

Alexandre Fayolle
Chef de Projet
Tel : +33 4 58 48 20 30

Camptocamp France SAS
Savoie Technolac, BP 352
73377 Le Bourget du Lac Cedex

                       Post to: