proj4

proj4 - Project#4 !CHEATING!

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
!CHEATING! Project #4 In project #4, you’re going to build an  internet plagiarism detector !
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Project #4 And a starting website URL like this: Your program will be given an essay like this: Deoxyribonucleic acid or DNA is a nucleic acid that contains the genetic instructions used in the development  and functioning of all known living organisms and some viruses. The main role of DNA molecules is the long- term storage of information. DNA is often compared to a set of blueprints or a recipe, or a code, since it  contains the instructions needed to construct other components of cells, such as proteins and RNA molecules.  The DNA segments that carry this genetic information are called genes, but other DNA sequences have  structural purposes, or are involved in regulating the use of this genetic information.  http://en.wikipedia.org/wiki/genetics And it has to find all web pages (including those linked from the starting  webpage) that have similar text to the provided essay.
Background image of page 2
Running your program: The 1 st  parameter ( essay.txt ) is a text file that contains the student’s essay. You’ll run your program like this: C:\CS32>  proj4.exe essay.txt http://wikipedia.org/genetics 3 10 5 The 2 nd  parameter ( wikipedia.org ) specifies a starting  webpage where you’re to search for similar text. The 3 rd  parameter ( 3 ) indicates that you should visit all pages directly reachable within three links from the start page. The 4 th  parameter ( 10 ) indicates that you should only extract and follow the first ten links per web page. The last parameter ( 5 ) means that a webpage is a match with the essay if at least five sets of terms are found to be in common.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Running your program: And here’s what your program might print out: C:\CS32>  proj4.exe essay.txt http://wikipedia.org/genetics 3 10 5 You’ll run your program like this: Searching for matches. .. Searched through 285 total web pages. There were 9 pages with at least 5 hits: www.wikipedia.org/wiki/RNA 100 www.wikipedia.org/wiki/DNA 14 www.wikipedia.org/wiki/Adenosine_triphosphate 8 www.wikipedia.org/wiki/Cancer 8 www.wikipedia.org/wiki/Genetics 6 www.wikipedia.org/wiki/Double_helix 5 www.wikipedia.org/wiki/Eye_color 5 www.wikipedia.org/wiki/Wikipedia:Requests_for_adminship 5 www.wikipedia.org/wiki/Wikipedia_talk:Reliable_sources 5
Background image of page 4
Our program really “Crawls” the web? Yes – your program will connect to the internet and
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 16

proj4 - Project#4 !CHEATING!

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online