Hi, this is my first post here and also my first attempt at creating a game at that scale, so any tips you might have that are not included in the docs will be greatly appreciated.
To give you some context, the game I am working on is mechanically a clone of an old Warcraft 3 map - Castle Fight and it looks a bit like this

Essentially, you have a base and you spawn more and more of different unit types, trying to overwhelm the enemy and destroy their base.
As you can imagine that means having a whole lot of units on the map, and from some playtesting it seems like the game becomes a lot more fun with the more units there are.
The current state of my game handles up to about 220-240 units flawlessly at 30 physics fps. Above that the fps starts to drop very quickly and at about 320 units it's running at ~3 fps. The profiler shows that it's the unit.physicsprocess method that takes up a huge chunk of the frame time with the next one being the physics time. Here's what that looks like

Both the phsysicsprocess and _process functions are the unit's ones. What is very interesting to me here is that the _process has been called 309 times which is the correct number of units, but for some reason the physicsprocess has been called three times that. I am guessing that has something to do with the fact that maybe the _physics_processes of all units take so long that they span multiple frames. Please correct me if I am wrong.
Here's what a good frame looks like in the profiler at around 230 units

And to show you what the units actually look like and do, here's what the unit scene looks like

where the visibleArea and attackArea are 2 areas which detect enemy units, but they are not called in the physicsprocess at all.
And here's what the physicsprocess actually does
func _physics_process(delta):
if path_id != -1:
var difference = path[path_id] - translation
move_and_slide(difference.normalized() * $entity.movement_speed, Vector3(0,1,0))
if difference.length() < 1:
path_id += 1
if path_id == path.size():
path_id = -1
And then lastly, it seems that the slowdown is directly related to the areas and the units getting stuck next to each other. If I disable the collision between the units, that gives a bit of a performance increase. If I disable the areas collision shapes, that also gives a pretty huge performance increase.
Unfortunately, I need them, so that's why I am here. I am curious to see if other people have faced the same issues and also how they solved them. Any suggestions for better design ideas will be appreciated.
I am looking to be able to get around 500-600 units maximum in the scene and keep the physics fps around 30.
I am confident I'll be able to rewrite a lot of my logic in C++ if that would give me a big enough increase. Additionally, another suggestion I saw thrown around is rather than using physics nodes, directly use the physics server, but I wanted to get more opinions before diving into something like that, as that would be quite a time sink.