Object tracking techniques have largely relied on tools like cameras, wearable devices, and smartphones. However, these methods often face issues such as privacy concerns, cost-effectiveness, and performance limitations under varying environmental conditions. To address these challenges, radio frequency (RF) based sensing has emerged as a promising alternative. Our research adopts this new approach by integrating RF signals with multi-modal learning algorithms for object activity tracking. In this paper, we employ a web camera, a TP-Link AC1750 WiFi router, and an Intel 5300 network interface card (NIC). The router and NIC are equipped with three transmit and receive antennas, respectively. Furthermore, the multi-modal model presented in this paper, features a teacher-student architecture. The teacher network transforms the video stream captured by the camera into motion keypoints, which are then utilized by the student network. The student network comprises a fully convolutional network (FCN) and trains on the mapped characteristics of keypoints and RF signals to effectively predict object movement. In our experimental results, we demonstrate that the designed multi-modal approach accurately detects and tracks object movement and motions.