Problem: represent image in text format understandable for LLMs (ChatGPT)
Solution:
Step 1: Detecting the elements on the picture to json format using pre-trained UIED model.
Goal: To determine the location of the emenents on the screen.
Example of usage:
Step 2: Using Visual Language model
Goal: To detemine the logic of the web-page