Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Want to make some of these yourself?