Repeated crashes at random times

Repeated crashes at random times

by Tomas Suarez -
Number of replies: 0

Hello everyone,


The last two months we’ve been facing some problems with our moodle implementation. Our main issue is that we can’t diagnose the problem. The behaviour is as follows: at seemingly random times, one of the cores of the web server's host reaches 100% usage, corresponding to a softirq for the network interface controller and stays that way. That makes it unavailable for the rest of the system, resulting in a complete crash. After a couple of minutes, the usage goes back down and it starts working again. We can accelerate this process by cutting all the connections between the webserver and the load balancer, reducing the timeout to just a few seconds.
At and around the time of those CPU usage spikes, there is no other strange behaviour. Memory, disk and network usage are normal, and both the load balancer and the db server are working as always.

At the very beginning we thought we were being attacked, so we took the following measures (Some of them were already in place):


# Firewall: Only allows connections via TCP and ports 80 and 443.


# TCP:

        tcp-request connection track-sc1 src table per_ip_connections

        tcp-request connection reject if { sc_conn_cur(1) ge 15 } || { sc_conn_rate(1) ge 20 }


# Slowloris:

        timeout http-request 5s

        option http-buffer-request


# HTTP:

        http-request deny if HTTP_1.0

        http-request track-sc0 src table per_ip_rates

        http-request deny deny_status 429 if { sc_http_req_rate(0) gt 150 }


# Against dubious browsers:

        http-request deny unless { req.hdr(user-agent) -m found }

        http-request deny if { req.hdr(user-agent) -i -m sub curl phantomjs slimerjs }


backend per_ip_rates

        stick-table type ip size 1m expire 1m store http_req_rate(5s)

backend per_ip_connections

        stick-table type ip size 1m expire 1m store conn_cur,conn_rate(3s)


# Syn Flood protection:

net.ipv4.tcp_syncookies=1

net.ipv4.conf.all.rp_filter=1

net.ipv4.tcp_max_syn_backlog = 1024

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1


As you may have guessed, it didn’t work. That’s why we also tried:


-. Adding resources to the VMs running both the docker container and the database.

-. Upgrading Moodle (includes PHP and MariaDB upgrades).

-. Moving the container to another virtual machine.

-. Installing irq balance


At the moment we stand clueless at what the root of the problem may be or what else to try, as everything we tried has failed. Any help would be greatly appreciated.

Here are some extra details of our Moodle implementation:


-. Dockerized

-. PHP: 7.4

-. MariaDB 10.3.29 (running natively in a separate VM within the same cluster)

-. Debian 10.9-slim

-. Web Server: apache2

-. Moodle 3.10.8


Infrastructure:

- 4 x DELL PowerEdge R530

- Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

- OpenNebula orchestrator (KVM)

- Hybrid storage server connected via fibre 


Docker host VM:
- 8 CPU Cores

- 16Gb RAM
- 30Gb Drive for the OS

- 1Tb for Moodle

- Debian 10

- Docker 20.10.10


Database VM:

- 8 CPU cores (running at half speed)

- 8Gb RAM

- 30Gb Drive

- Debian 10

- MariaDB 10.3.29



Thanks.


Average of ratings: -