- Scrutinize screenshots to view what kind of unique features is associated to the event you are interested in
- use labelImg(https://github.com/heartexlabs/labelImg) to annotate image
- use YOLO to train
- Use YOLO to predict object & OCR to infer texts and their boundaries
- Apply condition expression to identify event and split media units
- Each media unit has its text, calculate their similarity to judge if they are the same media units or not
- Determine the media unit is news or not by searching if it contains the keyword from the whitelist of news organizations
- output the CSV
- Rules
Events or split | Rules |
---|---|
Split posts | The upper boundary of "three_dots" (upper right at every post) |
opening external link | labeled 6 kinds of external link's headers |
viewing comments | 上方的 ">" 和 "讚" & 下方的 "留言框" |