work carried out

2023 - jan - 23

in DQN network as for the sequential layers

inputs -> 16 activation relu linear 16 -> 16 activation relu output 16 -> num actions

observation

the q value mean was a constant and in a stable position

conclusion

as for the control state i've removed the exploration by removing the epsilon

because of this in order to be stabilize the model it chooses to do nothing at all and this is a good example where exploration was removed and network is not enough to determine a stable position.

this experiment was a failed

next is to introduce the epsilon and see how it performs

q value

episode reward

2023 - jan - 24

tasks to do

[x] add a player where it generates random values with no probability to indicate a autistic behavior
[x] when feeding players into the environment introduce them one by one
- for example introduce medium player first and run the environment for 10000 steps and then add another player where it has similar features like the player introduced at the beginning and now the environment has 2 players and again run it for like 10000 steps and so on introduce other players as well to the environment.
[ ] add more state sets to the world where the model can freely move more on the environment.

problems accord during the process

adding all the problems accord during the process 
players were added when there is a reset on the environment. and a player was added every 100000 steps.

interesting things saw during the process

player feeding was done in 2 different ways,

average type of 3 players in the beginning and then the 3 types of best players and last the low type players
best 3 players first and then the average and finally the low players

observation 1

when feeding average players first

feed players loss

q values

reward

when feeding best players first

feed players loss

q values

reward

Note

As for the observation we can see that after adding the 3rd player no matter what it generally starts to go down on accuracy.

maybe it's because of the learning rate was decayed and a sudden change can't be handled by the model, so changing that back again to a higher value and gradually decaying back will help the model to gain it in a positive manner.
another suggestion is to add more dense layers to the model which it can learn more.
- but need to identify a proper method of adding new layers without trying randomness.

2023 Jan 25

as for the reward not being achieved, there can be few reasons and these are the things thant should be focused in order to maintain a result

reward not being achieved

If the reward in a reinforcement learning (RL) problem is not being achieved and is instead stabilizing at a negative value, it could be an indication of a poor or suboptimal policy being learned by the RL agent. There are several potential reasons for this:

Poor exploration: The agent may not be exploring the state space effectively and getting stuck in a suboptimal region. In this case, increasing the exploration rate or using more advanced exploration methods such as count-based exploration or intrinsic motivation may help.
Inadequate function approximation: If the agent is using a function approximator (such as a neural network) to represent the policy, it may not have enough capacity or be sufficiently expressive to capture the optimal policy. Increasing the complexity of the function approximator or using more data to train it may help.
Inadequate reward signal: The reward signal may not be providing enough information for the agent to learn the optimal policy. In this case, carefully designing the reward function or shaping the rewards to guide the agent towards the optimal policy may be helpful.
Stochasticity in the environment If the environment is stochastic, the agent may be unable to learn from the rewards and get stuck in poor regions. Try to reduce the stochasticity by adding more deterministic elements to the environment.
Poor learning algorithm The agent may be using a poor learning algorithm or the hyperparameters of the algorithm may not be well-tuned. In this case, experimenting with different learning algorithms or fine-tuning the hyperparameters may help.

It is essential to understand the problem and the environment, as well as experimenting with different approaches and techniques to find the one that works best for the specific problem at hand.