KidsCrawler

24 Jul 2007

Welcome to the KidsCrawler BackPackit page. Kidscrawler is a student project being developed for a graduate webmining course (B659) at Indiana University in Bloomington, IN. The goal of the project is to create a topical webcrawler whose focus will be gathering a kid-accessable subset of the World Wide Web. Our project will be distinct from existant kid-oriented indexes and search engines because, in addition to the standard filter to exclude inappropriate sites, we will be paying careful attention to the acessability of the sites we gather. We would like young children to actually be able to read and understand all of the websites in our database.

friday 2/3: Design choices. 3 Feb 2006

Today we drew up a basic flowchart for our crawler.

We decided to start out using a ‘best-first’ crawling algorithm, where a URL’s ranking is determined by two factors: the readability of its parent page, and the ‘kid-relatedness’ of its parent page and its anchor text. Pages which contain “bad words” will be discarded entirely along with their URL’s.

We also decided on a page parsing/evaluation procedure. Once a crawler has retrieved a page, it will be parsed, stemmed, and have its URL’s stripped and stored along with their anchor text. Then the page will be evaluated as follows :

1) Contain Controversial or Offensive Words?
Yes – discard page.
No – move to step 2

2) Calculate the readability index based on several sample sections of text, using classic educational algorithms. 5th-grade level or below?
No – move to step 3a
Yes – move to step 3b

3a) Evaluate the ‘kid-relatedness’ of the page using keywords and other indicators. Is it likely to contain URL’s which are more suitable? Assign an apropriate ranking to its URL’s and add them to the New URL queue. Discard the page.

3b) Save the page (or its link) to our database, along with its readability ranking. Assign its URL’s a ranking relative to it’s own readability index, and add them to the New URL queue.

4) When the New URL queue is full, sort it, stop the crawler, and merge this queue with the best-first sorted URL frontier.

5) Contiue.

friday 2/3: To Start With 3 Feb 2006

To start with, we will implement only the readability and ‘bad words’ modules. We will then test these by using on websites gathered by one of Fil’s simple open source crawlers. The object is to determine whether or not our criteria is successful at selecting only “kid-friendly” pages.

Work/Questions:

-Stemming (To stem or Not to stem? To stem very carefully? How to stem without affecting accuracy of readability indexes?)

-Bad word list. (Breast of chicken? or Breast of chick? What words will most effectively and efficiently select for the pages we actually want to eliminate?)

-Readability Algorithm. (Which one do we want to use? How complex? Will ‘big kid’ words count less negatively than high school or college words? Which one will be most robust against the poor grammar and punctuation of the internet?)

-Design Skeleton. (Start creating the necessary structure in java to implement all this. Make it play nice with fil’s example crawler. Figure out what our storage resources are, and if we want to be storing links, text, or whole pages).

-Evaluation. (Find some interesting/difficult pages in several categories: hard, easy, proper, improper, etc… and see how well our system ranks and classifies them).

Watch Out 3 Feb 2006

Big Standing Questions:

1) How and where do we want to store what we’re collecting?

2) How do we want to formally evaluate how well we did, once we really starting doing things?

3) What sort of user interface do we want to have on the finished product? How will we implement it? (index? search engine?)

4) What is the structure of the little kid’s web? How buried is it under flashy web graphics or sites aimed at parents and teachers? Do we have enough here to really be able to find a decent percentage of what’s out there?

friday 2/10: Progress 10 Feb 2006

Design Choices:

-consider aiming for k-2 more than 1-3rd, seems to be a big leap in reading ability at age 8.

-no stemmer… it screws with the readability ratings. We don’t want “studious” = “study”. We’ve found a dictionary, meant for determining readability, which includes all readable variants of its root words.

-we’ll be using pant’s javacrawlers as our skeleton crawler.

-we’re considering saving only the URL’s of the sites we find, and using “recall” as an evaluationt technique. There are a few major, high quality indexes of kid’s sites out there that are created by collecting and evaluating URL suggestions from users. We may be able to use these to create a ‘desired sites’ list to test our crawler’s coverage of the web. (of course, if our crawler stubmles on these indexes as its crawling, it will score 100%, so… may need to refine this).



Crawler Issues:

right now it looks like we’ll be focusing our efforts on 3 questions:

1) readability metric. Which of the readability metrics we used will adapt best to the web? We’ll want a objective way to contrast and compare… a sampling of online children’s books and sites written by kids for kids may help with this.

2) graphics. a lot of kids-sites are solid graphics… we’ll need to parse what we can and find some effective and efficient way of dealing with the rest, maybe partly by hand. we don’t want to neglect every kid’s site that’s written in flash.

3) ‘kid-related’... because of the organic construction of current kids directories (they are almost all done by hand, and the best and most complete seem to be created by volunteers and/or the site’s patrons themselves), it will do us a lot of good to be able to identify and put high priority on sites that contain indexes of small-children sites. this point might be a good place to make use of machine learning techniques, and might be something to continue working on after this semester.



Progress: a code skeleton in java, a dictionary located and partly encoded.

ToDo: write one or several readability metrics, finish encoding dictionary.

Final Crawler Update 20 Apr 2006

We have successfully implemented both a best first and breadth first crawler. Each crawler uses three separate reading metrics to evaluate the readability level of a page’s contents. The first two metrics are the Dale-Chall and Harris-Jacobson formulas, while the third is the Fry reading formula. Of the three, the Fry is the simplest, where it only counts the average number of syllables per sentence.

The crawlers were tested by allowing them to crawl 15,000 pages from a set of eight seed pages. The crawlers will be evaluated using several PageRank metrics. The first PageRank metric used is that described in the Brin and Page paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine.” Since we are using readability metrics to classify pages, we also implemented two topic-sensitive PageRank models described in “Web Page Scoring Systems for Horizontal and Vertical Search” by Diligenti, Gori, and Maggini. The first PageRank model described in this paper assumes that a surfer follows the links on a page according to the suggestions of some page classification method (e.g. readability metrics). The second PageRank model, called Double Focused PageRank, assumes that the choices of a surfer are based on the contents of the current page, and that when a surfer randomly jumps to another page, he/she will jump to another page with related content.

Finally, the Diligenti paper also describes a model for combining PageRank and HITS into a stable algorithm that can take into account the more complex relationships between different pages (i.e. hubs and authorities). This model was implemented to give us hub and authority scores for our crawled sites.

With these different metrics we hope to confidently assess the performance of our crawlers.

Link to Project Source Code 26 Apr 2006

Here’s a link to our project source code. It has been bundled up in a tgz file.

http://cs.indiana.edu/~jhpowell/kidcrawlercode.tgz