Yeah, using image recognition on a screenshot of the desktop and directing a mouse around the screen with coordinates is definitely an intermediate implementation. Open Interpreter, Shell-GPT, LLM-Shell, and DemandGen make a little more sense to me for anything that can currently be done from a CLI, but I’ve never actually tested em.
There used to be very real hardware reasons that upload had much lower bandwidth. I have no idea if there still are.