Kernels with zram for 841?

sterrenboer · 5. März 2018 um 22:07

There are a lot of active devices (841’s) that run into memory limits on networks with a lot of nodes. Has anybody tried running a kernel with compressed memory?

ZRAM might be able to reduce the ram pressure, without requiring swap. I wanted to give it a try. However, I noticed that the gluon kernels doesn’t include this module. So I can’t try it out of the box.

Before I embark on compiling my own images (and possibly bricking my 841), I’d want to know if it would be possible at all. I haven’t found a reason why it shouldn’t work. But perhaps I am missing something obvious?!? I tried googling the gluon and freifunk forums and found nothing.

(I do recognize that fitting in the zram feature will use up precious bytes on the flash storage.)

Thanks for any feedback you may have.

Nordrunner · 6. März 2018 um 17:12

Moin.

Also den Ansatz finde ich garnicht so unübel.

Aber ob die CPU das wuppen kann?

Vielleicht mit L2TP als Tunnel?

Gruß

adorfer · 6. März 2018 um 18:05

Ich habe es so verstanden, dass „wenn das RAM knapp wird, dann ist das so katastrophal, dass es nur eine Frage von Sekunden/Minuten wäre, dass auch die Doppelte Menge voll wäre“

Nordrunner · 6. März 2018 um 18:11

Moin.

Aha? Woan genau machst du das fest?

Wie geschrieben, ich denke es hapert an der Rechenpower und eventuell an den erhöhten Scheibzugriffen (ein.- und auspacken der Daten) auf dem Router.

Gruß

adorfer · 6. März 2018 um 18:14

Es ist ja nicht nur OOM, sondern auch „high load“: Was passiert wenn die switch- und batman-Tabellen (wird ja alles 2-3 fach im Speicher gehalten) anschwellen und trotzdem ständig aktuell gehalten werden.

Will sagen: Das Ram ist vermutlich nicht sonderlich statisch, sondern wird ständig umgeschaufelt.

Handle · 7. März 2018 um 02:45

Ja, nach den bisherigen Beobachtungen scheint das so zu sein. Ein paar kB mehr freier RAM helfen somit wohl leider nicht.

mtrnord · 7. März 2018 um 15:08

Ich weiß wohl nicht wie doll das schon gemacht wird aber garbage collecting klingt sinnvoller in diesem Fall. (falls dies bereits passiert bitte ignorieren was ich geschrieben habe)

Dago · 7. März 2018 um 19:37

Hi !
Not sure you may prefer this in english but here we go.

I am not sure how this is actually implemented. But I guess this is nothing but a compressing ramdisk attached as a blockdevice for paging. So we would speed up major page faults a lot by using ram access versus flash access. So we have a big tradeoff to consider. We probably need quite a lot of material before even the compression can be considered effective. So the device has to have a significant minimum size and that amount of memory may keep us out of the danger zone already (EDIT: …if we did not use it for the ramdisk ! )

I guess such mechanism is pretty good when you deal with large processes being started or stopped (or more likely) temporarily deactivated as they were considered hogging memory. Then we can obtain large amounts of memory by compressing large memory segments.And then memory thrashing may quickly come to a halt. But in those embedded devices we probably deal with large amounts of internal fragmentation of kernel internal data structures so there is little chance memory tiles stay untouched for so long they become candidates for getting paged out - if at all that type of (kernel) memory can be handled in that way.

In addition with an increasing amount of active nodes we probably see a growing workingset which is the areas of memory we constantly use all the time. If the workingset is getting larger than available memory
than we run into thrashing and then only reduction of the workingset is what can help at all.
Save memory by reducing overhead or by defining tasks as low prio which could be deactivated and done later - but such tasks do hardly exist for NW devices.

Dago · 7. März 2018 um 19:47

Noch mal ganz knapp auf deutsch wiederholt:

die ramdisk braucht evtl ueberhaupt erst mal ordentlich Platz um überhaupt effektiv zu werden durch die Compression also solche
vermutlich haben wir es vermutlich vorrangig mit internem Verschnitt zu tun, sodass sowas schwer effectiv behandelbar (im Sinne Arzt vs Patient) mit HW Speicherverwaltung, da muss man dann schon grosse Prozesse stilllegen, um da was stemmen zu koennen.
das Problem des NW device bei vielen Knoten ist sicher der Bedarf an Speicher, der konstant und ongoing gebrauccht wird. Das kann man mit memory compression nicht beheben.
Da muss man den Overhead reduzieren oder Dinge kurzfristig voruebergehend stilllegen, wobei letzteres aber sicher wg „iss nich“ entfaellt.

sterrenboer · 8. März 2018 um 10:50

The discussion is very interesting. German is perfectly fine, but my German grammar is not perfect. Zram works by having a compressed block device in memory, so it does not write to flash.

My local setup has an 841 and a CPE210. The 841 is connected to my home 13/1 DSL and the 210 through mesh-on-lan (and mesh-wlan). Average number of clients is 1. The maximum number of clients I’ve seen is 8. The VPN tunnel is unencrypted, so in theory the 841’s cpu is powerful enough for this DSL line. However, the 841 is seeing a very high load is relatively unstable. But it is not clear what is causing the high load. The cpu is still idling a bit and there is still ~6 mb of free memory.

As the discussion in the thread points out, using ZRAM is not likely to help, but I wanted to find out anyway. I have a little too much free time on my hands right now. (If anybody in the ffrn region needs data scientist that compiles linux distro’s for fun, contact me. ) So I tried it out. I compiled the the 2017.1.4 version of gluon and then manually scp’d the modules required for zram and the swapon binary. The zram packages are all part of the standard Lede build, but opkg is not installed in the image that is on my 841.

I first tried a 10 mb zram swap device and the device started thrashing straight away. A 5mb swap device did work, but the machine rebooted again after a couple of hours. I then tried a 3 mb and that was stable. However, the device was struggling more. The load is ~16 and the CPU is constantly being hammered. For reference, the output of `free’ is copied here:

total used free shared buffers cached
Mem: 27792 24064 3728 0 1176 1712
-/+ buffers/cache: 21176 6616
Swap: 2928 2908 20

For reference, the memory load without zram looks like this:
total used free shared buffers cached
Mem: 27792 21776 6016 132 1276 1760
-/+ buffers/cache: 18740 9052
Swap: 0 0 0

So doing zram has not been an improvement. The CPU’s is completely hammered. Again, this was expected given the discussion in the thread.

It still puzzles me what is causing the high load on this machine (without zram enabled). The machine has spare CPU cycles and memory. And it also does not seem to be spawning processes constantly. What is causing the high load? Is this already known? Or can I investigate this further?

Thanks for all the input!

P.S. A completely different topic is why a leave node, like mine, needs to have a 3 copies of the routing table of 1500+ entries. The routing table only needs ~11 entries (3 neighbours and 8 clients). All traffic with an unknown destination can just be forwarded to the VPN gateway.

MPW · 8. März 2018 um 14:54

Hi,

which version of Gluon are you running? It’s a known bug, that the lede-based Gluon versions show an extremly high load and or crash frequently.

Gluon-Issue; High load on some devices after v2017.1.x update · Issue #1243 · freifunk-gluon/gluon · GitHub

Regards,
Matthias

sterrenboer · 8. März 2018 um 15:08

I am on 2017.1.4. Thanks for the link. That threads suggest the following workaround.

echo fq_memory_limit 200 > /sys/kernel/debug/ieee80211/phy0/aqm

I’ll see if that works.

MPW · 10. März 2018 um 12:32

It didn’t for me ;). The problem is, that the device has so limited capabilities, that you can’t install the proper tools the track down this issue.

sterrenboer · 19. März 2018 um 11:16

The 841 has plenty of resources to debug it. It is just that the BAT tables take too much memory.