Read more
"This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences"--
List of contents
Preface xv
1 Introduction 1
1.1 Case study: World Heritage Sites in Danger 1
1.2 Some remarks on web data quality 7
1.3 Technologies for disseminating, extracting, and storing web data 9
1.3.1 Technologies for disseminating content on the Web 9
1.3.2 Technologies for information extraction from web documents 11
1.3.3 Technologies for data storage 12
1.4 Structure of the book 13
Part One A Primer onWeb and Data Technologies 15
2 HTML 17
2.1 Browser presentation and source code 18
2.2 Syntax rules 19
2.2.1 Tags, elements, and attributes 20
2.2.2 Tree structure 21
2.2.3 Comments 22
2.2.4 Reserved and special characters 22
2.2.5 Document type definition 23
2.2.6 Spaces and line breaks 23
2.3 Tags and attributes 24
2.3.1 The anchor tag 24
2.3.2 The metadata tag 25
2.3.3 The external reference tag 26
2.3.4 Emphasizing tags , , 26
2.3.5 The paragraphs tag
27
2.3.6 Heading tags , , ,... 27
2.3.7 Listing content with , , and 27
2.3.8 The organizational tags and 27
2.3.9 The tag and its companions 29
2.3.10 The foreign script tag 30
2.3.11 Table tags , , , and 32
2.4 Parsing 32
2.4.1 What is parsing? 33
2.4.2 Discarding nodes 35
2.4.3 Extracting information in the building process 37
Summary 38
Further reading 38
Problems 39
3 XML and JSON 41
3.1 A short example XML document 42
3.2 XML syntax rules 43
3.2.1 Elements and attributes 44
3.2.2 XML structure 46
3.2.3 Naming and special characters 48
3.2.4 Comments and character data 49
3.2.5 XML syntax summary 50
3.3 When is an XML document well formed or valid? 51
3.4 XML extensions and technologies 53
3.4.1 Namespaces 53
3.4.2 Extensions of XML 54
3.4.3 Example: Really Simple Syndication 55
3.4.4 Example: scalable vector graphics 58
3.5 XML and R in practice 60
3.5.1 Parsing XML 60
3.5.2 Basic operations on XML documents 63
3.5.3 From XML to data frames or lists 65
3.5.4 Event-driven parsing 66
3.6 A short example JSON document 68
3.7 JSON syntax rules 69
3.8 JSON and R in practice 71
Summary 76
Further reading 76
Problems 76
4 XPath 79
4.1 XPath--a query language for web documents 80
4.2 Identifying node sets with XPath 81
4.2.1 Basic structure of an XPath query 81
4.2.2 Node relations 84
4.2.3 XPath predicates 86
4.3 Extracting node elements 93
4.3.1 Extending the fun argument 94
4.3.2 XML namespaces 96
4.3.3 Little XPath helper tools 97
Summary 98
Further reading 99
Problems 99
5 HTTP 101
5.1 HTTP fundamentals 102
5.1.1 A short conversation with a web server 102
5.1.2 URL syntax 104
5.1.3 HTTP messages 106
5.1.4 Request methods 108
5.1.5 Status codes 108
5.1.6 Header fields 109
5.2 Advanced features of HTTP 116
5.2.1 Identification 116
5.2.2 Authentication 121
5.2.3 Proxies 123
5.3 Protocols beyond HTTP 124
5.3.1 HTTP Secure 124
5.3.2 FTP 126
5.4 HTTP in action 126
5.4.1 The libcurl library 127
&nb
About the author
Simon Munzert is the author of
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Christian Rubba is the author of
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Peter Meißner is the author of
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Dominic Nyhuis is the author of
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.