Data Wrangling with MongoDB - Udacity course notes

Posted by Monik, 14 December 2014.
Programming MongoDB Python Data Mining NoSQL
Course notes

These are the notes I took while taking the “Data Wrangling with MongoDB” course at Udacity. It tells how to use Python to process CSV, XML, Excel, and how to work with MongoBD. Also some examples for page scraping in Python.

Table of contents

1. Data extraction Fundamentals

1.1 Tabular data and CSV

1.2. JSON format

2. Data in More Complex Formats

2.1. XML

2.2. Data scraping

The example of website about arrivals and departures and various airports. There are two combo boxes, so we would have to click a lot to get all the data. We want to rather write a script for us.
  • first we view page source to find all the codes of the combobox lists
  • then we open firebug to see what calls exactly are made, and which parts are depending on combobox clicks
  • we try to identify the other parameters (e.g. sth like "_viewstate", etc)
  • we write a script which makes call by call and generates a set of static HTML pages
  • ! important - first store the data, then analyse it ! - best practice
  • Beautiful Soup - from bs4 import BeautifulSoup - the module for finding the values of a field in HTML, expecially for this kind of tasks
  • with open(page, "r") as html:
            soup = BeautifulSoup(html)
            data["eventvalidation"] = soup.find(id="__EVENTVALIDATION")['value']
            data["viewstate"] = soup.find(id="__VIEWSTATE")['value']
    
  • we also have to maintain session it turned out
  • ~~requests.post("http://..", data={...})~~
    s = requests.Session()
    s.post("http://..", data={...})
    
  • then write another script(s) that iterate(s) through the downloaded files, clean(s) up the data (=audits it), and do(es) what we need to do  - this is actually covered in several practical exercises

3. Data Quality

3.1. Various small examples

3.2. Exercise

The exercise is about analysing and cleaning cities data set. We count number of data types, deciding about which digit of areaLand we are more likely to use, and choosing the more accurate one, changing string array of city names to python array, checking the lat and lon locations.

4. Working with MongoDB

4.1. Pymongo

  • is python driver for mongo - keeps connection to database
  • minimal example
  • from pymongo import MongoClient
    
    client = MongoClient('localhost:27017')
    db = client[db_name]
    query = {"manufacturer" : "Porsche"}
    return db.autos.find(query)
    
    return db.autos.find_one(query)
    
  • projection
  • return db.autos.find(query, {"_id":0, "name":1})
    
    • normally _id is included by default, unless specifies like above explicitly
  • insert
  • db.autos.insert(auto)
    
  • mongoimport cmd line utility - imports whole JSON files to DB
  • >mongoimport -d examplesdb -c autos --file autos.json

  • operators - start from "$"
  • $gt, $lt, $gte, $lte, $ne
  • query = {"foundingDate" : {"$gte":datetime(2001,1,1), "$lte" : datetime(2099,12,31)}}
    
  • $exists
  • query = {"governmentType":{"$exists" : 1}}
    
  • $regex
  • queries work inside arrays
  • also can work against other arrays, $in and $all
  • $in
  • query = {"assembly":{"$in":["Germany","United Kingdom","Japan"]}, "manufacturer":"Ford Motor Company"}
    
  • $all
  • query = {"modelYears":{"$all":[1965, 1966, 1967]}}
    
  • can also access hierarchy
  • query = {"dimensions.width":{"$gt":2.5}}
    
  • save(obj) method - insert or update (if the _id exists and such object is in db)
  • update(obj) - for (bulk) partial updates
  • update + $set
  • city = db.cities.update({"name":"Munich",
                             "country":"Germany"},
                            {"$set":{"isoCountryCode":"DEU"}})
    
  • update + $unset
  • city = db.cities.update({"name":"Munich",
                             "country":"Germany"},
                            {"$unset":{"isoCountryCode":
                                     "blahblah_this is ignored"}})
    
  • bulk update
  • city = db.cities.update({"name":"Munich",
                             "country":"Germany"},
                            {"$set":{"isoCountryCode":"DEU"}},
                            multi=True)
    
    
  • remove() - removes all
  • city = db.cities.remove()
    
  • remove(query)

4.2. Exercise

A lot of specific data cleaning and modifying, while copying it from CVS to mongo, row by row. It’s about arachnid (spiders) data set.

5. Analysing Data

5.1. Aggregation framework in MongoDB

5.2. Exercises

Just building different pipelines.

6. Case study

It’s about OpenStreetMap data set, which can be downloaded in XML from their site. You can download part which you are looking at, or download data of major cities. They also have a very nice wiki.

The data is XML with “node”s and “way”s (way = street, road, etc). The data is human edited, so it contains errors.

6.1. SAX XML parsing

import xml.etree.ElementTree as ET

for event, item in ET.iterparse(xml_filename, events=(start,))
  handle_node(elem)

The non iterative parsing (reading all to memory at once) could go like:

tree = ET.parse(xml_filename)
root = tree.getroot()
for child in root:
  handle_node(child)

6.2 Regular expressions

import re

lower = re.compile(r'^([a-z]|_)*$')
re.findall(lower, string)

m = lower.search(string)
if m:
  substring = m.group()

6.3 Exercise

It’s about parsing a XML document, and iterating over XML nodes and creating proper python dictionaries (specified in the task description).

Conclusions


Comments


Comments: