I have done things like this in the past. With OpenCV, I would probably start with a haar cascade face detector, then a tracking loop to filter the angle to center of face, then construct a region around the face, then pull that out using CV::remap to generate the virtual camera.

One challenge you might face is that the remap step is effectively digital zoom, and 4K@180° may not be a lot of pixels per face, depending on how far away people sit. Some systems merge many image sensors to provide good resolution of each face, and some use a cone mirror to provide a cylindrical FOV.