Artificial intelligence strives to enable machines to replicate human-like perception and interaction. This research focuses on the realm of Multimodal Emotion Recognition in Conversation (MERC). This pervasive and practical task finds application in healthcare, intelligent driving systems, and autonomous robotics, but remains challenging due to the need for context awareness and the influence of hidden variables like speaker and listener interactions. This literature review delves into four key challenges within ERC: the subjective nature of emotion, conversational context, speaker-listener dynamics, and emotion shifts during conversations. We also concluded the mainstreams of the algorithm architectures and classified them in five aspects, namely, multimodal ERC, attention mechanisms, transformer-based models, commonsense knowledge integration, and the deployment of large language models (LLMs).