Giter Site home page Giter Site logo

paulpierre / markdown-crawler Goto Github PK

View Code? Open in Web Editor NEW
116.0 4.0 14.0 1.16 MB

A multithreaded πŸ•ΈοΈ web crawler that recursively crawls a website and creates a πŸ”½ markdown file for each page, designed for LLM RAG

Home Page: https://pypi.org/project/markdown-crawler/

License: MIT License

Python 100.00%
html-to-markdown html-to-markdown-converter html2md llm llmops markdown markdown-parser rag web-scraper markdown-crawler

markdown-crawler's Introduction

πŸ—οΈ building evm & llm πŸ΄β€β˜ οΈ warez

β–ˆβ–€β–€β€ƒβ–ˆβ–€β–ˆβ€ƒβ–ˆβ–„β–‘β–ˆβ€ƒβ–ˆβ–€β–„β€ƒβ–ˆβ–‘β–ˆβ€ƒβ–ˆβ–€β–€β€ƒβ–€β–ˆβ–€β€ƒβ–ˆβ€ƒβ–ˆβ–‘β–ˆβ€ƒβ–ˆβ–€β–€
β–ˆβ–„β–„β€ƒβ–ˆβ–„β–ˆβ€ƒβ–ˆβ–‘β–€β–ˆβ€ƒβ–ˆβ–„β–€β€ƒβ–ˆβ–„β–ˆβ€ƒβ–ˆβ–„β–„β€ƒβ–‘β–ˆβ–‘β€ƒβ–ˆβ€ƒβ–€β–„β–€β€ƒβ–ˆβ–ˆβ–„
https://conductive.ai  // @conductiveai

⭐️ repos πŸ’¬ RasaGPT - Headless chat platform built on top of Rasa and Langchain 2k+ ⭐️'s #3 trending on HN
πŸ™ Hydralisk - scale and fund millions of EVM-chain wallets via CLI
πŸ•΅οΈ Informer - Telegram Mass Surveillance w/ 1.3k+ ⭐️'s and 170+ β‘‚ featured on the frontpage of HN
🐦 twig.py - a twitter web3 influencer truffle pig used for finding engaged users
🧠 tech evm, solidity, erigon, geth, gpt, llms, agents, langchain, llamaindex, stable diffusion, lora
πŸ’¬ lang python, golang, js, kotlin, swift, vue, nodejs
πŸ’½ data clickhouse, postgres, timescaledb, ksqldb
πŸš‡ pipelines kafka, redpanda, celery, rabbitmq, redis, debezium, airflow
πŸ§‘β€πŸ³ orchest k8s, hashicorp, docker, nomad l00n1x
πŸ‘·β€β™‚οΈ work build0r of web3 user acquistion tech and conversational AI
🌱 xp adtech, data engineering, mobile games, machine learning
❀️ ❀️ 🐍🎷, graph theory, sci-fi, games, πŸ€– openly a balaji's network state npc
πŸŽ™οΈ podcasts all-in-podcast, lex fridman, yc, network state, a16z, MoZ, MFM, proof, jre, flagrant2

gm

markdown-crawler's People

Contributors

paulpierre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

markdown-crawler's Issues

Not all links are found when custom `target_content` is set.

I wanted to try out this package for an internal RAG presentation based on some of our company website data. Our website has the typical structure: header, main content and footer. Most of the links are in the header and footer part, but if I set target_content to only the main container, no links are getting found.

Is there a way to have all URLs collected outside of the container i ultimately want to parse into markdown? When i don't specify the target i am getting quite a mess of navigation and footer link data that repeats on every page (which would have to be cleaned up for every single page)..

Thanks in advance!


Edit: I think it was actually a combination of the domain, base path flags and the target content that returns zero results. once I removed all the extra flags, the target content worked as advertised.

Continue where you left off feature is not working robustly

I believe there's a problem when crawling massive documentation portals in a scenario where the crawler cannot handle a certain condition and exists suddenly, leading to the continuation task to start from the beginning but not intelligently, therefore leading to all the previous pages being checked before moving forward with the new download. This is my constant experience with stripe's documentation portal where every time after around an hour the crawler fails to move forward and there are no option to move forward exactly where you left off.
my suggestion is to switch the default behavior to continue where you left off without previous check, since I assume what you're doing with the current implantation is to also detect if there any changes and apply them as well on the existing files instead of literally sticking to the current state of the progress and attempting to finish the crawl job which is taking massive times.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.