It's 42: HTML Content Extraction

Recently, I'm writing an HTML content extraction engine for VanillaBrain.com project.

요즘 저는 바닐라브레인 프로젝트에서 사용할 HTML 본문 추출 엔진을 작성하고 있습니다.

I can't tell what VanillaBrain project is yet due to it's premature state.

아직 조숙한 상태여서 바닐라브레인 프로젝트가 무엇인지는 말씀드릴 수 없지만,

But I can share some useful or inspirational links about HTML content extraction.

HTML 본문 추출에 관한 쓸만하거나 영감을 주는 링크는 공유드릴 수 있지요.

Related products:

Apache Tika (content analysis toolkit) : http://tika.apache.org/

Apache Nutch (Web Crawler SDK) :http://chaehyun.kr/wiki/images/b/bc/Nutch_intro_20110806_open.pdf

Apache Mahout (Machine Learning) : http://mahout.apache.org/

Readability : http://lab.arc90.com/2009/03/02/readability/

Readibility.com source (old) : https://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js

Diffbot.com : https://www.diffbot.com/

Provalis Wordstat : http://provalisresearch.com/products/content-analysis-software/

KNIEME : http://www.knime.org/knime

KNIEME Guide : http://www.knime.org/files/003_-_knime_boston2013_koetter.pdf

HTML Structure Visualization :http://stackoverflow.com/questions/14911274/tools-to-visualize-an-html-document-tree-dom-tree

None of above products are JavaScript enabled and I was unable to find one.

위에 열거한 제품들은 JavaScript는 지원하지 않고, 지원하는 제품은 찾지 못했습니다.

I'm trying out some headless web browsers to crawl JavaScript dependent pages but can't find satisfiable one yet.

JavaScript에 의존하는 페이지들을 긁어오기 위해 headless web browser들을 시도해 보고 있지만, 아직 만족스러운 것을 찾지는 못했습니다.

https://code.google.com/p/boilerpipe/

http://wwwis.win.tue.nl/bnaic2009/papers/bnaic2009_paper_113.pdf

http://eprints.weblyzard.com/55/1/lang2012-textSweeper.pdf

http://luyin.tistory.com/349

http://cafe.daum.net/_c21_/bbs_search_read?grpid=1CBe5&fldid=JxlJ&datanum=58

http://jakarta.tistory.com/76

https://trello.com/c/Xdy1qU1o/5--

https://github.com/yasserg/crawler4j

http://helloworld.naver.com/helloworld/112673

http://blog.leeyh92.com/?p=331

I was inspired by above articles and write a graph based algorithm now but it's not working properly yet.

위의 글들에서 영감을 얻어서 그래프 기반의 알고리즘을 작성하고 있습니다만, 아직은 잘 동작하지는 않습니다.

It's a crucial function for VanillaBrain's MVP so I'm quite anxious now cause out fund is bleeding.

요게 바닐라브레인의 MVP에서 꽤 중요한 기능인제, 저희의 활동 자금은 계속 줄어들고 있어서 요즘 불안한 상태입니다.

But what can I say? Just keep doing my best :)

뭐 어쩌겠나요. 그저 최선을 다 할 수밖에 ^^

It's 42

2015년 5월 21일 목요일

HTML Content Extraction

Related products:

Related articles:

댓글 없음:

댓글 쓰기

2015년 5월 21일 목요일

HTML Content Extraction

Related products:

Related articles:

댓글 없음:

댓글 쓰기

Subscribe