This line assumes that term_probs is a tensor of size (batch_size, 2) but it's a tensor of size (batch_size, 1) because it went through Linear(embedding_size, 1) and F.sigmoid.
In the prosocial case, the reward printed should be rewards[0][2] and not the one listed here.
In fact, what does taking the mean of all 3 rewards mean? For selfish, I believe there should be 2 rewards: one for each agent at the end of the episode.