Server(forge) keeps crashing

Driberif · July 31, 2020, 12:53pm

Hi,

Im running MineOS on my freenas system using the plugin. Lately i have upgraded the server from vanilla to a forge client (running JEI, Mekanism+addons and refined storage). However since this upgrade the server keeps crashing:
java.lang.Error: ServerHangWatchdog detected that a single server tick took 60.01 seconds (should be max 0.05)

Server seems fine until we start walking around (aka, try to leave the house). I have checked the system running the server, the CPU is barely being hit and there is still plenty of RAM available aswell. have already tried upgrading the forge version (it was forge-1.16.1-32.0.70, now running forge-1.16.1-32.0.98), but that didnt help.
Any suggestions for what could be causing this / how to prevent it? I’m not sure if this is a forge/minecraft issue or a MineOS issue.

Since i dont have permissions to attach .txt files, please refer to pastebin for the crash log(s):

aklakina · July 31, 2020, 2:57pm

Hey there!

Look up for “bet jvm arguments for running minecraft server”.

By default java uses only one core for all it’s computation and that’s why it can bug and crash out.

If this does not solve it then there should be a timeout option in the server.properties file and you should set it to 0. This turns off this watchdog “feature”.

Driberif · July 31, 2020, 10:13pm

Hi,
I looked up some better jvm arguments, however the server once again crashed after ~10-15 minutes.
Edit: the exact arguments i ended up using (and which had 0 effect from what i could tell):
-XX:+UseG1GC -Dsun.rmi.dgc.server.gcInterval=2147483646 -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:G1ReservePercent=20 -XX:MaxGCPauseMillis=50 -XX:G1HeapRegionSize=32M
With the -Xmx and Xms both set to 2048MB using the web interface

Now this is probably my fault for not mentioning it before, but there is some lag before the server dies aswell. Considering that, turning off the watchdog would seem counterproductive and only lead to the server going down a lag spiral until manual reboot?

Regarding the single core usage: i can see in my freenas interface that all 4 of the cores are being hit ~equally, averaging around 10%.

aklakina · August 1, 2020, 8:45am

So, I decided to post my arguments that i use for running an Enigmatica 2 expert modpack for me and my friends.
-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxGCPauseMillis=100 -XX:+DisableExplicitGC -XX:TargetSurvivorRatio=90 -XX:G1NewSizePercent=50 -XX:G1MaxNewSizePercent=80 -XX:G1MixedGCLiveThresholdPercent=35 -XX:+AlwaysPreTouch -XX:+ParallelRefProcEnabled -Dusing.aikars.flags=mcflags.emc.gs -Dfile.encoding=UTF-8

If you are worried about turning of the watchdog then pregenerate a large area while the server is up but you are not playing on it. That way you can take off a huge amount of load while adventuring and such so the server won’t crash that easily.

Hope this helps.

hexparrot · August 1, 2020, 2:29pm

Unpopular opinion, but I believe that all this crashing is likely caused by Java tweaks.

As is often the case, these really long java-tweak copy/pastes often do not have practical performance increases, but defeinitely can put java in a state of unpredictable usage and throughput.

-Dsun.rmi.dgc.server.gcInterval=2147483646

Are you familiar with the default distributed garbage collector for RMI?

Notable problems related to this tweak: “You observe a pattern of explicit garbage collection on a regular interval (usually every 60 seconds), but the application does not contain any System.gc() calls. The problem might be Remote Method Invocation (RMI) distributed garbage collection.”

XX:MaxGCPauseMillis=50

The amount of time a garbage collector can schedule compute time before surrendering it back to other processes.

If a Garbage collection cannot complete in 50 milliseconds, it will stop, making comparatively inefficient use of its time. Incomplete garbage collections will result in a need for the garbage collection to run again, sooner than expected, because it didn’t clear as much in the 50 millisecond allotment it previously had. To put it in perspective, it used to be 200ms.

Realistically, think of this like cleaning your kitchen, instead of a lot at once (like before, after), but by doing it every 5 minutes, for 30 seconds. There’s so much overhead in switching your mode from “operating mode” to “cleaning mode” that there is a needless increase in overhead and context switching.

XX:G1NewSizePercent=20

This makes the G1 collector’s size of the young generation, set as a percentage of the heap.

You’re making the young generation size 20%, which in itself isn’t bad (compared to the 5% default), but you make the heap 4 times larger and you gave the time to garbage clean it 1 quarter the time it previously had.

Realistically, you’re telling the G1 collector to work 16 times more efficiently (if it were to complete GCs), which I don’t think it is pulling off.

-XX:G1ReservePercent=20

Purpose: “Sets the percentage of reserve memory to keep free so as to reduce the risk of to-space overflows. The default is 10 percent. When you increase or decrease the percentage, make sure to adjust the total Java heap by the same amount.”

That is, you’re setting the memory reserve (read as "memory that can’t be used by minecraft) to be twice as large). "to-space is used during GCs to help determine proper file tenuring.

Put these all together:

We’re making the young generation of the heap bigger.
We’re giving it less time to go through it
We’re giving minecraft less space by reserving more of the heap for garbage collection overhead.
We’re using experimental options and having unexpected behavior.

Driberif · August 1, 2020, 2:59pm

Fair. I’ll freely admit i have (mostly) no clue what most of those tweaks did, i just took the advice aklakina gave me earlier in this thread and googled some tweaks to see if that would help. Although it’s hard to tell, it definitely didn’t make things any better.

When i made this post the only parameters i had in there were the default from Mineos, but those also led to crashes.

Any chance you have some suggestions of what exactly is causing the problems here, or maybe even a way to fix them?