Cover image for Data munging with Perl
Title:
Data munging with Perl
Author:
Cross, David, 1962-
Personal Author:
Publication Information:
Greenwich, Conn. : Manning Publications, 2001.
Physical Description:
xviii, 283 pages ; 24 cm
Language:
English
ISBN:
9781930110007
Format :
Book

Available:*

Library
Call Number
Material Type
Home Location
Status
Item Holds
Searching...
QA76.73.P22 C39 2001 Adult Non-Fiction Central Closed Stacks
Searching...

On Order

Summary

Summary

Techniques for using Perl to recognize, parse, transform, and filter data.


Author Notes

Dave Cross is the owner and Managing Director of Magnum Solutions Ltd., an internet and database consultancy based in London. He has 12 years' experience working in the IT industry. He is an active member of the Perl community, the founder of the London Perl Mongers, and is also a regular columist for Perlmonth , the online Perl magazine.


Table of Contents

Forewordp. xi
Prefacep. xiii
About the cover illustrationp. xviii
Part I Foundationsp. 1
1 Data, data munging, and Perlp. 3
1.1 What is data munging?p. 4
Data munging processesp. 4
Data recognitionp. 5
Data parsingp. 6
Data filteringp. 6
Data transformationp. 6
1.2 Why is data munging important?p. 7
Accessing corporate data repositoriesp. 7
Transferring data between multiple systemsp. 7
Real-world data munging examplesp. 8
1.3 Where does data come from? Where does it go?p. 9
Data filesp. 9
Databasesp. 10
Data pipesp. 11
Other sources/sinksp. 11
1.4 What forms does data take?p. 12
Unstructured datap. 12
Record-oriented datap. 13
Hierarchical datap. 13
Binary datap. 13
1.5 What is Perl?p. 14
Getting Perlp. 15
1.6 Why is Perl good for data munging?p. 16
1.7 Further informationp. 17
1.8 Summaryp. 17
2 General munging practicesp. 18
2.1 Decouple input, munging, and output processesp. 19
2.2 Design data structures carefullyp. 20
Example: the CD file revisitedp. 20
2.3 Encapsulate business rulesp. 25
Reasons to encapsulate business rulesp. 26
Ways to encapsulate business rulesp. 26
Simple modulep. 27
Object classp. 28
2.4 Use UNIX "filter" modelp. 31
Overview of the filter modelp. 31
Advantages of the filter modelp. 32
2.5 Write audit trailsp. 36
What to write to an audit trailp. 36
Sample audit trailp. 37
Using the UNIX system logsp. 37
2.6 Further informationp. 38
2.7 Summaryp. 38
3 Useful Perl idiomsp. 39
3.1 Sortingp. 40
Simple sortsp. 40
Complex sortsp. 41
The Orcish Manoeuvrep. 42
Schwartzian transformp. 43
The Guttman-Rosler transformp. 46
Choosing a sort techniquep. 46
3.2 Database Interface (DBI)p. 47
Sample DBI programp. 47
3.3 Data::Dumperp. 49
3.4 Benchmarkingp. 51
3.5 Command line scriptsp. 53
3.6 Further informationp. 55
3.7 Summaryp. 56
4 Pattern matchingp. 57
4.1 String handling functionsp. 58
Substringsp. 58
Finding strings within strings (index and rindex)p. 59
Case transformationsp. 60
4.2 Regular expressionsp. 60
What are regular expressions?p. 60
Regular expression syntaxp. 61
Using regular expressionsp. 65
Example: translating from English to Americanp. 70
More examples:/etc/passwdp. 73
Taking it to extremesp. 76
4.3 Further informationp. 77
4.4 Summaryp. 78
Part II Data Mungingp. 79
5 Unstructured datap. 81
5.1 ASCII text filesp. 82
Reading the filep. 82
Text transformationsp. 84
Text statisticsp. 85
5.2 Data conversionsp. 87
Converting the character setp. 87
Converting line endingsp. 88
Converting number formatsp. 90
5.3 Further informationp. 94
5.4 Summaryp. 95
6 Record-oriented datap. 96
6.1 Simple record-oriented datap. 97
Reading simple record-oriented datap. 97
Processing simple record-oriented datap. 100
Writing simple record-oriented datap. 102
Caching datap. 105
6.2 Comma-separated filesp. 108
Anatomy of CSV datap. 108
Text::CSV_XSp. 109
6.3 Complex recordsp. 110
Example: a different CD filep. 111
Special values for $/p. 113
6.4 Special problems with date fieldsp. 114
Built-in Perl date functionsp. 114
Date::Calcp. 120
Date::Manipp. 121
Choosing between date modulesp. 122
6.5 Extended example: web access logsp. 123
6.6 Further informationp. 126
6.7 Summaryp. 126
7 Fixed-width and binary datap. 127
7.1 Fixed-width datap. 128
Reading fixed-width datap. 128
Writing fixed-width datap. 135
7.2 Binary datap. 139
Reading PNG filesp. 140
Reading and writing MP3 filesp. 143
7.3 Further informationp. 144
7.4 Summaryp. 145
Part III Simple Data Parsingp. 147
8 Complex data formatsp. 149
8.1 Complex data filesp. 150
Example: metadata in the CD filep. 150
Example: reading the expanded CD filep. 152
8.2 How not to parse HTMLp. 154
Removing tags from HTMLp. 154
Limitations of regular expressionsp. 157
8.3 Parsersp. 158
An introduction to parsersp. 158
Parsers in Perlp. 161
8.4 Further informationp. 162
8.5 Summaryp. 162
9 HTMLp. 163
9.1 Extracting HTML data from the World Wide Webp. 164
9.2 Parsing HTMLp. 165
Example: simple HTML parsingp. 165
9.3 Prebuilt HTML parsersp. 167
HTML::LinkExtorp. 167
HTML::TokeParserp. 169
HTML::TreeBuilder and HTML::Elementp. 171
9.4 Extended example: getting weather forecastsp. 172
9.5 Further informationp. 174
9.6 Summaryp. 174
10 XMLp. 175
10.1 XML overviewp. 176
What's wrong with HTML?p. 176
What is XML?p. 176
10.2 Parsing XML with XML::Parserp. 178
Example: parsing weather.xmlp. 178
Using XML::Parserp. 179
Other XML::Parser stylesp. 181
XML::Parser handlersp. 188
10.3 XML::DOMp. 191
Example: parsing XML using XML::DOMp. 191
10.4 Specialized parsers--XML::RSSp. 193
What is RSS?p. 193
A sample RSS filep. 193
Example: creating an RSS file with XML::RSSp. 195
Example: parsing an RSS file with XML::RSSp. 196
10.5 Producing different document formatsp. 197
Sample XML input filep. 197
XML document transformation scriptp. 198
Using the XML document transformation scriptp. 205
10.6 Further informationp. 208
10.7 Summaryp. 208
11 Building your own parsersp. 209
11.1 Introduction to Parse::RecDescentp. 210
Example: parsing simple English sentencesp. 210
11.2 Returning parsed datap. 212
Example: parsing a Windows INI filep. 212
Understanding the INI file grammarp. 213
Parser actions and the item arrayp. 214
Example: displaying the contents of itemp. 214
Returning a data structurep. 216
11.3 Another example: the CD data filep. 217
Understanding the CD grammarp. 218
Testing the CD file grammarp. 219
Adding parser actionsp. 220
11.4 Other features of Parse::RecDescentp. 223
11.5 Further informationp. 224
11.6 Summaryp. 224
Part IV The big Picturep. 225
12 Looking back--and aheadp. 227
12.1 The usefulness of thingsp. 228
The usefulness of data mungingp. 228
The usefulness of Perlp. 228
The usefulness of the Perl communityp. 229
12.2 Things to knowp. 229
Know your datap. 229
Know your toolsp. 230
Know where to go for more informationp. 230
Appendix A Modules referencep. 232
Appendix B Essential Perlp. 254
Indexp. 273