Ok let's break it down again then:
The human eye does not see frames it sees an endless stream of light hitting it's sensor array, and that sensor array has it limits. Our sensors need a certain accumulation of light before an object is clearly visible so when objects move the light coming from then will distribute across our view and as it gets faster the form gets an ever decreasing pronunciation, i.e. speed blurs objects up to the point we can't spot their presence at all.
But despite all that we never have a cut off moment where we would stop getting new information.
Movies are captured in a similar way, this time the light is captured with a digital sensor array or film that also has limits as far as light/data accumulation goes. Objects that move fast will again leave a fainter trace of their form across the scene. Difference comes in where movies do need a cut off point, they need to store their data picture by picture for the tech to work and there we get the frames.
So 24FPS was worked out to look smooth enough, that makes each picture stay statically on display for 42ms, sounds like a small amount of time but compared to our normal vision that has 0ms of static pictures this is quite the gap.
Luckily the shortfalls of sensors/film is exactly what makes movies so compatible, that 42ms picture might be static but in the time to make it near 42ms of incoming light was recorded, so all the movement in that time frame actually was captured and we lost very little information.
Games however work from the complete opposite end, renderings do not capture ongoing scenery they create the scene from scratch each moment at a time. And every frame they create has come from an absolute zero standstill of the scene, so if you render at 24 FPS that 42ms static image is not an accumulation of anything between fames, you completely lost the ongoings for the past 42ms... which again is a minute time frame but to our vision that is a gigantic information gap.
So we go up to 30FPS which makes a 33.3ms information gap, at 60FPS we get a 16.6ms gap, 120FPS down to 8.3ms, 240FPS - 4.2ms, so on and so forth.
But wait, my monitor doesn't swap images that fast so why would we even go there? Because most game engines have their mechanics locked to visual frames, then the information gap isn't just in your visual part it also affects input.
At 30FPS there are 33,3ms gaps where the game has no clue what you are telling it, then the game jumps to the next moment and then you need to counter compensate for whatever it missed/got wrong. In slow games with slow control schemes that gap is mostly covered over by the games inherit delayed response, but as the need for precision and speed goes up that gap becomes and unavoidable hindrance to the entire experience.
Does 60 FPS then fix all the problems then... no it just makes them half as bad as before, and 120 would make it one quarter as bad.