Data Wrangling with MongoDB - Udacity course notes

Course notes

These are the notes I took while taking the “Data Wrangling with MongoDB” course at Udacity. It tells how to use Python to process CSV, XML, Excel, and how to work with MongoBD. Also some examples for page scraping in Python.

1. Data extraction Fundamentals
- 1.1 Tabular data and CSV
- 1.2. JSON format
2. Data in More Complex Formats
- 2.1. XML
- 2.2. Data scraping
3. Data Quality
- 3.1. Various small examples
- 3.2. Exercise
4. Working with MongoDB
- 4.1. Pymongo
- 4.2. Exercise
5. Analysing Data
- 5.1. Aggregation framework in MongoDB
- 5.2. Exercises
6. Case study

1. Data extraction Fundamentals

data wrangling = gathering, extracting, cleaning and storing data
you have to assure your data is correct before doing anything else with it, especially if a human was involved in writing the data

1.1 Tabular data and CSV

items (row) have fields (columns), and first row contains label
parsing CSV in python - manually

with open(datafile, "r") as f:
  titles = string.split(f.readline(), ",")
     for line in f:
        data.append(create_data(titles,string.split(line, ",")))
        i=i+1
        if (i>10):
          break
return data

parsing XLS in python - with XLRD module

import xlrd

workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, col) for col in range(2)] for r in range(sheet.nrows)]
minVal = data[1][1]
minIndex = 1

parsing CSV with python - with CSV module

import csv

with open(datafile,'rb') as f:
    reader = csv.reader(f)#, delimiter=',', quotechar='|'
    name = reader.next()[1]
    print name
    reader.next()
    while True:
      try:
        data.append(reader.next())
      except StopIteration:
        break

writing CSV with python

with open(filename, 'wb') as csvfile:
  writer = csv.writer(csvfile, delimiter='|')
  for row in data:
    writer.writerow(row)

1.2. JSON format

nested objects and arrays
different items can have different fields

2. Data in More Complex Formats

2.1. XML

designed for platform independent data transfer
supports validation
the domain: citation analysis, analysing citations in XML versions of scientific publications
XML parsing with python

import xml.etree.ElementTree as ET

tree = ET.parse(fname)
root = tree.getroot()

for author in root.findall('./fm/bibl/aug/au'):
    data = {
                "fnm": author.find('fnm').text,
                "snm": author.find('snm').text,
                "email": author.find('email').text,
    }
    authors.append(data)

handling attributes

for author in root.findall('./fm/bibl/aug/au'):
   insrs = author.findall('./insr')
        for insr in insrs:
            data["insr"].append(insr.attrib["iid"])
        print data

2.2. Data scraping

The example of website about arrivals and departures and various airports. There are two combo boxes, so we would have to click a lot to get all the data. We want to rather write a script for us.

first we view page source to find all the codes of the combobox lists
then we open firebug to see what calls exactly are made, and which parts are depending on combobox clicks
we try to identify the other parameters (e.g. sth like "_viewstate", etc)
we write a script which makes call by call and generates a set of static HTML pages
! important - first store the data, then analyse it ! - best practice
Beautiful Soup - from bs4 import BeautifulSoup - the module for finding the values of a field in HTML, expecially for this kind of tasks

with open(page, "r") as html:
        soup = BeautifulSoup(html)
        data["eventvalidation"] = soup.find(id="__EVENTVALIDATION")['value']
        data["viewstate"] = soup.find(id="__VIEWSTATE")['value']

we also have to maintain session it turned out

~~requests.post("http://..", data={...})~~
s = requests.Session()
s.post("http://..", data={...})

then write another script(s) that iterate(s) through the downloaded files, clean(s) up the data (=audits it), and do(es) what we need to do - this is actually covered in several practical exercises

3. Data Quality

data cleaning is an iterative process
measures of data quality:

validity - how it conforms to a schema (official or inofficial)
accuracy - "do all the street addresses exist?" - compare with some gold standard data
completness - are some record missing?
consistency - does some data contradict other parts of data
uniformity - e.g. same units

process:

audit data (statistical analysis - how many types of which value exist)
plan how to to correct
test if it worked
manual correction step
repeat, until you have confidence in your data

you can also do data enhancement besides the cleaning

3.1. Various small examples

open street map data example - we change St. to Street and Av. to Avenue, etc., after seeing the statistics (validity)
dbpedia data set about cities - we find the error where density is in persons/square km, and area is in square meters, we also see arrays instead of single values, multiple timezones (uniformity)
dbpedia data set about countries - we find some names which are not a country, or are arrays, column shift, by comparing the data to ISO country codes (accuracy)
example with video and screen capture of an exam - less likelihood that both are missing (completness)
example with "who do i trust the most" - e.g. there are 2 different addresses for same person - have to decide which one is more reliable, e.g. how was the data collected, which one is more accurate, etc. (consistency)
example of counting numbers of data types (including null) (uniformity)

3.2. Exercise

The exercise is about analysing and cleaning cities data set. We count number of data types, deciding about which digit of areaLand we are more likely to use, and choosing the more accurate one, changing string array of city names to python array, checking the lat and lon locations.

4. Working with MongoDB

flexible schema, easy to handle hierarchical data
JSON documents - convenient for programmers
flexible deployment (local or cloud)

4.1. Pymongo

is python driver for mongo - keeps connection to database
minimal example

from pymongo import MongoClient

client = MongoClient('localhost:27017')
db = client[db_name]
query = {"manufacturer" : "Porsche"}
return db.autos.find(query)

return db.autos.find_one(query)

projection

return db.autos.find(query, {"_id":0, "name":1})

normally _id is included by default, unless specifies like above explicitly

insert

db.autos.insert(auto)

mongoimport cmd line utility - imports whole JSON files to DB

>mongoimport -d examplesdb -c autos --file autos.json

operators - start from "$"
$gt, $lt, $gte, $lte, $ne

query = {"foundingDate" : {"$gte":datetime(2001,1,1), "$lte" : datetime(2099,12,31)}}

$exists

query = {"governmentType":{"$exists" : 1}}

$regex
queries work inside arrays
also can work against other arrays, $in and $all
$in

query = {"assembly":{"$in":["Germany","United Kingdom","Japan"]}, "manufacturer":"Ford Motor Company"}

$all

query = {"modelYears":{"$all":[1965, 1966, 1967]}}

can also access hierarchy

query = {"dimensions.width":{"$gt":2.5}}

save(obj) method - insert or update (if the _id exists and such object is in db)
update(obj) - for (bulk) partial updates
update + $set

city = db.cities.update({"name":"Munich",
                         "country":"Germany"},
                        {"$set":{"isoCountryCode":"DEU"}})

update + $unset

city = db.cities.update({"name":"Munich",
                         "country":"Germany"},
                        {"$unset":{"isoCountryCode":
                                 "blahblah_this is ignored"}})

bulk update

city = db.cities.update({"name":"Munich",
                         "country":"Germany"},
                        {"$set":{"isoCountryCode":"DEU"}},
                        multi=True)

remove() - removes all

city = db.cities.remove()

remove(query)

4.2. Exercise

A lot of specific data cleaning and modifying, while copying it from CVS to mongo, row by row. It’s about arachnid (spiders) data set.

5. Analysing Data

twitter data set
followers, followees, tweets, tweet contents
user id is called "screen_name"
tasks like "find who tweeted the most"

5.1. Aggregation framework in MongoDB

$group and $sort

db = get_db('twitter')
pipeline = [
   {"$group": {"_id":"$source",
               "count": {"$sum": 1}}
   },
   {"$sort": {'count': -1}}]

result = db.tweets.aggregate(pipeline)

$skip and $limit - skip few first documents, or limit the output to number of documents only
$match - for filtering, and $project

pipeline = [{"$match":{"user.time_zone":"Brasilia"}},
            {"$match":{"user.statuses_count":{"$gte":100}}},
            {"$project":{"followers":"$user.followers_count",
                         "screen_name":"$user.screen_name",
                         "tweets":"$user.statuses_count"}},
            {"$sort":{"followers":-1}},
            {"$limit":1}]

$unwind - will multiply the record for each value in an array

pipeline = [{"$match":{"country":"India"}},
            {"$unwind":"$isPartOf"},
            {"$group":{"_id":"$isPartOf", "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":1}]

$sum, $first, $last, $max, $min, $avg

pipeline = [{"$match":{"country":"India"}},
            {"$unwind":"$isPartOf"},
            {"$group":{"_id":"$isPartOf",
                       "avg":{"$avg":"$population"}}
            },
            {"$group":{"_id":"totalAvg",
                       "avg":{"$avg":"$avg"}}
            }
           ]

$push and $addToSet

pipeline = [{"$group":
            {"_id":"$user.screen_name",
             "tweet_texts":{"$push":"$text"},
             "count":{"$sum":1}}},
            {"$sort":{"count":-1}},
            {"$limit":5}
           ]

indices - they are hierarchical; the hierarchy determines in which order an item can be searched - think about it<

db.autos.ensureIndex({"name": 1})

geospatial indices - don't search by exact point, but "near" to points ($near)

db.autos.ensureIndex(loc_field, direction)

5.2. Exercises

Just building different pipelines.

6. Case study

It’s about OpenStreetMap data set, which can be downloaded in XML from their site. You can download part which you are looking at, or download data of major cities. They also have a very nice wiki.

The data is XML with “node”s and “way”s (way = street, road, etc). The data is human edited, so it contains errors.

6.1. SAX XML parsing

import xml.etree.ElementTree as ET

for event, item in ET.iterparse(xml_filename, events=(“start”,))
  handle_node(elem)

The non iterative parsing (reading all to memory at once) could go like:

tree = ET.parse(xml_filename)
root = tree.getroot()
for child in root:
  handle_node(child)

node.tag - name of XML node
node.attrib - dictionary of node attributes
node.attrib.keys() - array of attribute names

6.2 Regular expressions

import re

lower = re.compile(r'^([a-z]|_)*$')
re.findall(lower, string)

m = lower.search(string)
if m:
  substring = m.group()

6.3 Exercise

It’s about parsing a XML document, and iterating over XML nodes and creating proper python dictionaries (specified in the task description).

Conclusions

Reading the actual data helps a lot, makes you see misconceptions, and understand what is actually going on
Read the task carefully and do not switch off thinking ;)
Latitude comes before longitude

Comments

Comments:

TAGS

2024

2020

2017

2016

2015

2014

2013

2012

2011

2010