Data Wrangling with MongoDB - Udacity course notes

Posted by Monik, 14 December 2014.
Programming MongoDB Python Data Mining NoSQL
Course notes

These are the notes I took while taking the “Data Wrangling with MongoDB” course at Udacity. It tells how to use Python to process CSV, XML, Excel, and how to work with MongoBD. Also some examples for page scraping in Python.

Table of contents

1. Data extraction Fundamentals

1.1 Tabular data and CSV

1.2. JSON format

2. Data in More Complex Formats

2.1. XML

2.2. Data scraping

The example of website about arrivals and departures and various airports. There are two combo boxes, so we would have to click a lot to get all the data. We want to rather write a script for us.

3. Data Quality

3.1. Various small examples

3.2. Exercise

The exercise is about analysing and cleaning cities data set. We count number of data types, deciding about which digit of areaLand we are more likely to use, and choosing the more accurate one, changing string array of city names to python array, checking the lat and lon locations.

4. Working with MongoDB

4.1. Pymongo

4.2. Exercise

A lot of specific data cleaning and modifying, while copying it from CVS to mongo, row by row. It’s about arachnid (spiders) data set.

5. Analysing Data

5.1. Aggregation framework in MongoDB

5.2. Exercises

Just building different pipelines.

6. Case study

It’s about OpenStreetMap data set, which can be downloaded in XML from their site. You can download part which you are looking at, or download data of major cities. They also have a very nice wiki.

The data is XML with “node”s and “way”s (way = street, road, etc). The data is human edited, so it contains errors.

6.1. SAX XML parsing

import xml.etree.ElementTree as ET

for event, item in ET.iterparse(xml_filename, events=(start,))
  handle_node(elem)

The non iterative parsing (reading all to memory at once) could go like:

tree = ET.parse(xml_filename)
root = tree.getroot()
for child in root:
  handle_node(child)

6.2 Regular expressions

import re

lower = re.compile(r'^([a-z]|_)*$')
re.findall(lower, string)

m = lower.search(string)
if m:
  substring = m.group()

6.3 Exercise

It’s about parsing a XML document, and iterating over XML nodes and creating proper python dictionaries (specified in the task description).

Conclusions


Comments


Comments: