Classification

Classification - 1 IFT6255 Information Retrieval IFT6255 Information Retrieval Text classification Text classification 2 Overview Overview •

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 IFT6255: Information Retrieval IFT6255: Information Retrieval Text classification Text classification 2 Overview Overview • Definition of text classification • Important processes in classification • Classification algorithms • Advantages and disadvantages of algorithms • Performance comparison of algorithms • Conclusion 3 Text Classification Text Classification • Text classification (text categorization): assign documents to one or more predefined categories classes Documents ? class1 class2 . . . classn 4 Illustration of Text Classification Illustration of Text Classification Science Sport Art 5 Applications of Text Classification Applications of Text Classification • Organize web pages into hierarchies • Domain-specific information extraction • Sort email into different folders • Find interests of users • Etc. 6 Text Classification Framework Text Classification Framework Documents Preprocessing Indexing Feature selection Applying classification algorithms Performance measure 7 Preprocessing Preprocessing • Preprocessing: transform documents into a suitable representation for classification task – Remove HTML or other tags – Remove stopwords – Perform word stemming (Remove suffix ) 8 Indexing • Indexing by different weighing schemes: – Boolean weighing – Word frequency weighing – tf*idf weighing – ltc weighing – Entropy weighing 9 Feature Selection Feature Selection • Feature selection: remove non-informative terms from documents =>improve classification effectiveness =>reduce computational complexity 10 Different Feature Selection Methods Different Feature Selection Methods • Document Frequency Thresholding (DF) – tf > threshold – tf*idf • Information Gain (IG) ) | ( ) ( ) | ( log ) | ( ) ( ) | ( log ) | ( ) ( ) ( log ) ( ) ( 1 1 1 w samples H samples H w c P w c P w P w c P w c P w P c P c P w IG K j j j K j K j j j j j- = + +- = ∑ ∑ ∑ = = = 11 Different Feature Selection Methods Different Feature Selection Methods 2200 χ 2-statistic (CHI) or – A: w and C j B: w and not C j – C: not w and C j D: not w and not C j • Mutual Information (MI) ) ( ) ( log ) ( ) ( ) , ( log ) , ( B A C...
View Full Document

This note was uploaded on 01/09/2012 for the course CS CS273 taught by Professor Xifengyan during the Spring '11 term at UCSB.

Page1 / 50

Classification - 1 IFT6255 Information Retrieval IFT6255 Information Retrieval Text classification Text classification 2 Overview Overview •

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online