Chapter 23 - Chapter 23 Chapter Text Processing Bjarne...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 23 Chapter Text Processing Bjarne Stroustrup www.stroustrup.com/Programming Stroustrup/Programming 1 Overview Overview Application domains Strings I/O Maps Regular expressions Stroustrup/Programming 2 Now you know the basics Now Really! Congratulations! Don’t get stuck with a sterile focus on programming language Don’t features features What matters are programs, applications, what good can you What do with programming do Text processing Numeric processing Embedded systems programming Banking Medical applications Scientific visualization Animation Animation Route planning Physical design Stroustrup/Programming 3 Text processing Text “all we know can be represented as text” And often is Books, articles Transaction logs (email, phone, bank, sales, …) Web pages (even the layout instructions) Tables of figures (numbers) Mail Amendment I Programs Congress shall make no law respecting an establishment of religion, or Measurements prohibiting Historical data the free exercise thereof; or abridging the freedom of speech, or of the press; or the Medical records right of the people peaceably to … assemble, Stroustrup/Programming and to petition the government for a redress 4 String overview String Strings std::string <string> s.size() s1==s2 C-style string (zero-terminated array of char) std::basic_string<Ch>, e.g. unicode strings <cstring> or <string.h> <cstring> or strlen(s) strcmp(s2,s2) typedef std::basic_string<char> string; Proprietary string classes Stroustrup/Programming 5 String conversion String Simple to_string template<class T> string to_string(const T& t) { ostringstream os; os << t; return os.str(); } For example: string s1 = to_string(12.333); string string s2 = to_string(1+5*6-99/7); Stroustrup/Programming 6 String conversion String Simple extract from string template<class T> T from_string(const string& s) { istringstream is(s); T t; if (!(is >> t)) throw bad_from_string(); return t; } For example: double d = from_string<double>("12.333"); double Matrix<int,2> m = from_string< Matrix<int,2> >("{ {1,2}, {3,4} }"); Stroustrup/Programming 7 General stream conversion General template<typename Target, typename Source> template<typename Target lexical_cast(Source arg) { std::stringstream ss; Target result; if (!(ss << arg) // read arg into stream if // read || !(ss >> result) // read result from stream // read ||| !(ss >> std::ws).eof()) // stuff left in stream? | // stuff throw bad_lexical_cast(); return result; return } string s = lexical cast<string>(lexical_cast<double>(" 12.7 ")); // ok // ok // works for any type that can be streamed into and/or out of a string: // works XX xx = lexical_cast<XX>(lexical_cast<YY>(XX(whatever))); // !!! // !!! Stroustrup/Programming 8 I/O overview I/O Stream I/O in >> x Read from in into x according to x’s format out << x Write x to out according to x’s format in.get(c) Read a character from in into c getline(in,s) Read a line from in into the string s istream istringstream ostringstream ostream iostream stringstream ifstream ofstream filestream Stroustrup/Programming 9 Map overview Map Associative containers The backbone of text manipulation <map>, <set>, <unordered_map>, <unordered_set> map multimap set multiset unordered_map unordered_multimap unordered_set unordered_multiset Find a word See if you have already seen a word Find information that correspond to a word See example in Chapter 23 Stroustrup/Programming 10 Map overview Map multimap<string,Message*> “John Doe” “John Doe” “John Q. Public” Mail_file: vector<Message> Stroustrup/Programming 11 A problem: Read a ZIP code problem: U.S. state abbreviation and ZIP code two letters followed by five digits string s; string while (cin>>s) { if (s.size()==7 && isletter(s[0]) && isletter(s[1]) && isdigit(s[2]) && isdigit(s[3]) && isdigit(s[4]) && isdigit(s[5]) && isdigit(s[6])) cout << "found " << s << '\n'; } Brittle, messy, unique code Stroustrup/Programming 12 A problem: Read a ZIP code problem: Problems with simple solution Problems It’s verbose (4 lines, 8 function calls) We miss (intentionally?) every ZIP code number not We separated from its context by whitespace separated We miss (intentionally?) every ZIP code number with a We space between the letters and the digits TX 77845 We accept (intentionally?) every ZIP code number with the We letters in lower case letters "TX77845", TX77845-1234, and ATM77845 "TX77845" TX77845-1234 tx77845 If we decided to look for a postal code in a different format If we have to completely rewrite the code we CB3 0DS, DK-8000 Arhus DK-8000 Stroustrup/Programming 13 TX77845-1234 TX77845-1234 1st try: wwddddd 2nd (remember -12324): wwddddd-dddd What’s “special”? 3rd: \w\w\d\d\d\d\d-\d\d\d\d 4th (make counts explicit): \w2\d5-\d4 5th (and “special”): \w{2}\d{5}-\d{4} \w{2}\d{5}-\d{4} But -1234 was optional? 6th: \w{2}\d{5}(-\d{4})? \w{2}\d{5} We wanted an optional space after TX 7th (invisible space): \w{2} ?\d{5}(-\d{4})? \w{2} 8th (make space visible): \w{2}\s?\d{5}(-\d{4})? \w{2}\s?\d{5} 9th (lots of space – or none): \w{2}\s*\d{5}(-\d{4})? Stroustrup/Programming 14 Regex library – availability Regex Not part of C++98 standard Part of “Technical Report 1” 2004 Part of C++0x Ships with VS 9.0 C++, use <regex>, std::tr1::regex <regex> VS GCC 4.3.0, use <tr1/regex>, std::tr1::regex <tr1/regex> GCC www.boost.org, use <boost/regex>, std::boost::regex <boost/regex> www.boost.org, Stroustrup/Programming 15 #include <boost/regex.hpp> #include #include <iostream> #include <string> #include <fstream> using namespace std; int main() int { ifstream in("file.txt"); if (!in) cerr << "no file\n"; // input file // input boost::regex pat ("\\w{2}\\s*\\d{5}(-\\d{4})?"); cout << "pattern: " << pat << '\n'; // ZIP code pattern ZIP // … // } Stroustrup/Programming 16 int lineno = 0; int string line; // input buffer // input while (getline(in,line)) { ++lineno; boost::smatch matches; // matched strings go here // matched if (boost::regex_search(line, matches, pat)) { cout << lineno << ": " << matches[0] << '\n'; // whole match whole if (1<matches.size() && matches[1].matched) cout << "\t: " << matches[1] << '\n‘; // sub-match // sub-match } } Stroustrup/Programming 17 Results Results Input: address TX77845 ffff tx 77843 asasasaa ggg TX3456-23456 howdy zzz TX23456-3456sss ggg TX33456-1234 cvzcv TX77845-1234 sdsas xxxTx77845xxx TX12345-123456 Output: pattern: "\w{2}\s*\d{5}(-\d{4})?" Output: 1: TX77845 2: tx 77843 5: TX23456-3456 : -3456 6: TX77845-1234 : -1234 7: Tx77845 8: TX12345-1234 : -1234 Stroustrup/Programming 18 Regular expression syntax Regular Regular expressions have a thorough theoretical Regular foundation based on state machines foundation The syntax is terse, cryptic, boring, useful You can mess with the syntax, but not much with the semantics Go learn it Examples Xa{2,3} Xb{2} Xc{2,} \w{2}-\d{4,5} (\d*:)?(\d+) (\d*:)?(\d+) Subject: (FW:|Re:)?(.*) [a-zA-Z] [a-zA-Z_0-9]* [^aeiouy] // Xaa Xaaa // Xaa // Xbb // Xbb // Xcc Xccc Xcccc Xccccc … // Xcc // \w is letter \d is digit // \w // 124:1232321 :123 123 // // . (dot) matches any character // identifier // identifier // not an English vowel // Stroustrup/Programming 19 Searching vs. matching Searching Searching for a string that matches a regular expression in an (arbitrarily long) stream of data in regex_search() looks for its pattern as a substring in the regex_search() stream stream Matching a regular expression against a string (of known size) known regex_match() looks for a complete match of its pattern regex_match() and the string and Stroustrup/Programming 20 Table grabbed from the web Table KLASSE KLASSE 0A 12 1A 7 1B 4 2A 10 3A 10 4A 7 4B 10 5A 19 6A 10 6B 9 7A 7 7G 3 7I 7 8A 10 9A 12 0MO 3 0P1 1 0P2 0 10B 4 10CE 0 1MO 8 2CE 8 3DCE 3 4MO 4 6CE 3 8CE 4 9CE 4 REST 5 Alle klasser ANTAL DRENGE 11 8 11 13 12 7 5 8 9 10 19 5 3 16 15 2 1 5 4 1 5 5 3 1 4 4 9 6 184 ANTAL PIGER 23 15 15 23 22 14 15 27 19 19 26 8 10 26 27 5 2 5 8 1 13 13 6 5 7 8 13 11 202 ELEVER IALT ELEVER • • • • Numeric fields Text fields Invisible field separators Semantic dependencies • i.e. the numbers actually mean something • • first row + second row == third row Last line are column sums 386 Stroustrup/Programming 21 Describe rows Describe Header line Regular expression: ^([\w ]+)( \d+)( ^([\w As string literal: "^([\\w ]+)( As "^([\\w \d+)( \\d+)( \d+)$ \\d+)( \\d+)$" Aren’t those invisible tab characters annoying? [\\w ]+)*$" Other lines Regular expression: ^[\w ]+( [\w ]+)*$ As string literal: "^[\\w ]+( Define a tab character class Aren’t those invisible space characters annoying? Use \s Use \s Stroustrup/Programming 22 Simple layout check Simple int main() { ifstream in("table.txt"); // input file // input if (!in) error("no input file\n"); string line; // input buffer string // input int lineno = 0; regex header( "^[\\w ]+( regex regex row( "^([\\w ]+)( // … check layout … check [\\w ]+)*$"); \\d+)( \\d+)( // header line header \\d+)$"); // data line \\d+)$"); data } Stroustrup/Programming 23 Simple layout check Simple int main() { // … open files, define patterns … if (getline(in,line)) { // check header line // check smatch matches; if (!regex_match(line, matches, header)) error("no header"); error("no } while (getline(in,line)) { // check data line // check ++lineno; smatch matches; if (!regex_match(line, matches, row)) if error("bad line", to_string(lineno)); error("bad } } Stroustrup/Programming 24 Validate table Validate int boys = 0; int girls = 0; // column totals // column while (getline(in,line)) { // extract and check data while // extract smatch matches; if (!regex_match(line, matches, row)) error("bad line"); int curr_boy = from_string<int>(matches[2]); // check row int // check int curr_girl = from_string<int>(matches[3]); int curr_total = from_string<int>(matches[4]); if (curr_boy+curr_girl != curr_total) error("bad row sum"); if (matches[1]==“Alle klasser”) { // last line; check columns: if // last if (curr_boy != boys) error("boys don’t add up"); if (curr_girl != girls) error("girls don’t add up"); return 0; } boys += curr_boy; girls += curr_girl; } Stroustrup/Programming Stroustrup/Programming 25 Application domains Application Text processing is just one domain among many Image processing Sound processing Data bases Or even several domains (depending how you count) Browsers, Word, Acrobat, Visual Studio, … Medical Scientific Commercial Commercial … Numerics Financial … Stroustrup/Programming 26 ...
View Full Document

Ask a homework question - tutors are online