Fr. 90.00

Peter N Meibner, Peter Meiner, Peter Meißner, Peter et Meissner, Simo Munzert, Simon Munzert...

Automated Data Collection with R

English · Hardback

Shipping usually within 3 to 5 weeks

Description

"This book provides a unified framework of web scraping and information extraction from text data with R for the social sciences"--

List of contents

Preface xv

1 Introduction 1

1.1 Case study: World Heritage Sites in Danger 1

1.2 Some remarks on web data quality 7

1.3 Technologies for disseminating, extracting, and storing web data 9

1.3.1 Technologies for disseminating content on the Web 9

1.3.2 Technologies for information extraction from web documents 11

1.3.3 Technologies for data storage 12

1.4 Structure of the book 13

Part One A Primer onWeb and Data Technologies 15

2 HTML 17

2.1 Browser presentation and source code 18

2.2 Syntax rules 19

2.2.1 Tags, elements, and attributes 20

2.2.2 Tree structure 21

2.2.3 Comments 22

2.2.4 Reserved and special characters 22

2.2.5 Document type definition 23

2.2.6 Spaces and line breaks 23

2.3 Tags and attributes 24

2.3.1 The anchor tag 24

2.3.2 The metadata tag 25

2.3.3 The external reference tag 26

2.3.4 Emphasizing tags , , 26

2.3.5 The paragraphs tag

27

2.3.6 Heading tags , , ,... 27

2.3.7 Listing content with , , and 27

2.3.8 The organizational tags and 27

2.3.9 The tag and its companions 29

2.3.10 The foreign script tag 30

2.3.11 Table tags , , , and 32

2.4 Parsing 32

2.4.1 What is parsing? 33

2.4.2 Discarding nodes 35

2.4.3 Extracting information in the building process 37

Summary 38

Further reading 38

Problems 39

3 XML and JSON 41

3.1 A short example XML document 42

3.2 XML syntax rules 43

3.2.1 Elements and attributes 44

3.2.2 XML structure 46

3.2.3 Naming and special characters 48

3.2.4 Comments and character data 49

3.2.5 XML syntax summary 50

3.3 When is an XML document well formed or valid? 51

3.4 XML extensions and technologies 53

3.4.1 Namespaces 53

3.4.2 Extensions of XML 54

3.4.3 Example: Really Simple Syndication 55

3.4.4 Example: scalable vector graphics 58

3.5 XML and R in practice 60

3.5.1 Parsing XML 60

3.5.2 Basic operations on XML documents 63

3.5.3 From XML to data frames or lists 65

3.5.4 Event-driven parsing 66

3.6 A short example JSON document 68

3.7 JSON syntax rules 69

3.8 JSON and R in practice 71

Summary 76

Further reading 76

Problems 76

4 XPath 79

4.1 XPath--a query language for web documents 80

4.2 Identifying node sets with XPath 81

4.2.1 Basic structure of an XPath query 81

4.2.2 Node relations 84

4.2.3 XPath predicates 86

4.3 Extracting node elements 93

4.3.1 Extending the fun argument 94

4.3.2 XML namespaces 96

4.3.3 Little XPath helper tools 97

Summary 98

Further reading 99

Problems 99

5 HTTP 101

5.1 HTTP fundamentals 102

5.1.1 A short conversation with a web server 102

5.1.2 URL syntax 104

5.1.3 HTTP messages 106

5.1.4 Request methods 108

5.1.5 Status codes 108

5.1.6 Header fields 109

5.2 Advanced features of HTTP 116

5.2.1 Identification 116

5.2.2 Authentication 121

5.2.3 Proxies 123

5.3 Protocols beyond HTTP 124

5.3.1 HTTP Secure 124

5.3.2 FTP 126

5.4 HTTP in action 126

5.4.1 The libcurl library 127
&nb

About the author

Simon Munzert is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Christian Rubba is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Peter Meißner is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.
Dominic Nyhuis is the author of Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Product details

Authors	Peter N Meibner, Peter Meiner, Peter Meißner, Peter et Meissner, Simo Munzert, Simon Munzert, Simon Rubba Munzert, Munzert Simon, Dominic Nyhuis, Nyhuis Dominic, Christia Rubba, Christian Rubba, Rubba Christian
Publisher	Wiley, John and Sons Ltd

Languages	English
Product format	Hardback
Released	26.12.2014

EAN	9781118834817
ISBN	978-1-118-83481-7
No. of pages	480
Dimensions	177 mm x 252 mm x 27 mm
Subjects	Natural sciences, medicine, IT, technology > Mathematics > Probability theory, stochastic theory, mathematical statistics Non-fiction book Statistik, Soziologie, Data Mining, Sociology, Statistics, Data Mining Statistics, Soziologische Forschungsmethoden, Research Methodologies, R (Programm), Statistiksoftware / R, Statistical Software / R

Customer reviews

No reviews have been written for this item yet. Write the first review and be helpful to other users when they decide on a purchase.

Write a review

Thumbs up or thumbs down? Write your own review.