These are the notes I took while taking the “Data Wrangling with MongoDB” course at Udacity. It tells how to use Python to process CSV, XML, Excel, and how to work with MongoBD. Also some examples for page scraping in Python.
Table of contents
- 1. Data extraction Fundamentals
- 2. Data in More Complex Formats
- 3. Data Quality
- 4. Working with MongoDB
- 5. Analysing Data
- 6. Case study
1. Data extraction Fundamentals
- data wrangling = gathering, extracting, cleaning and storing data
- you have to assure your data is correct before doing anything else with it, especially if a human was involved in writing the data
1.1 Tabular data and CSV
- items (row) have fields (columns), and first row contains label
- parsing CSV in python - manually
- parsing XLS in python - with XLRD module
- parsing CSV with python - with CSV module
- writing CSV with python
with open(datafile, "r") as f:
titles = string.split(f.readline(), ",")
for line in f:
data.append(create_data(titles,string.split(line, ",")))
i=i+1
if (i>10):
break
return data
import xlrd
workbook = xlrd.open_workbook(datafile)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, col) for col in range(2)] for r in range(sheet.nrows)]
minVal = data[1][1]
minIndex = 1
import csv
with open(datafile,'rb') as f:
reader = csv.reader(f)#, delimiter=',', quotechar='|'
name = reader.next()[1]
print name
reader.next()
while True:
try:
data.append(reader.next())
except StopIteration:
break
with open(filename, 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter='|')
for row in data:
writer.writerow(row)
1.2. JSON format
- nested objects and arrays
- different items can have different fields
2. Data in More Complex Formats
2.1. XML
- designed for platform independent data transfer
- supports validation
- the domain: citation analysis, analysing citations in XML versions of scientific publications
- XML parsing with python
- handling attributes
import xml.etree.ElementTree as ET
tree = ET.parse(fname)
root = tree.getroot()
for author in root.findall('./fm/bibl/aug/au'):
data = {
"fnm": author.find('fnm').text,
"snm": author.find('snm').text,
"email": author.find('email').text,
}
authors.append(data)
for author in root.findall('./fm/bibl/aug/au'):
insrs = author.findall('./insr')
for insr in insrs:
data["insr"].append(insr.attrib["iid"])
print data
2.2. Data scraping
- first we view page source to find all the codes of the combobox lists
- then we open firebug to see what calls exactly are made, and which parts are depending on combobox clicks
- we try to identify the other parameters (e.g. sth like "_viewstate", etc)
- we write a script which makes call by call and generates a set of static HTML pages
- ! important - first store the data, then analyse it ! - best practice
- Beautiful Soup - from bs4 import BeautifulSoup - the module for finding the values of a field in HTML, expecially for this kind of tasks
- we also have to maintain session it turned out
- then write another script(s) that iterate(s) through the downloaded files, clean(s) up the data (=audits it), and do(es) what we need to do - this is actually covered in several practical exercises
with open(page, "r") as html:
soup = BeautifulSoup(html)
data["eventvalidation"] = soup.find(id="__EVENTVALIDATION")['value']
data["viewstate"] = soup.find(id="__VIEWSTATE")['value']
~~requests.post("http://..", data={...})~~
s = requests.Session()
s.post("http://..", data={...})
3. Data Quality
- data cleaning is an iterative process
- measures of data quality:
- validity - how it conforms to a schema (official or inofficial)
- accuracy - "do all the street addresses exist?" - compare with some gold standard data
- completness - are some record missing?
- consistency - does some data contradict other parts of data
- uniformity - e.g. same units
- process:
- audit data (statistical analysis - how many types of which value exist)
- plan how to to correct
- test if it worked
- manual correction step
- repeat, until you have confidence in your data
- you can also do data enhancement besides the cleaning
3.1. Various small examples
- open street map data example - we change St. to Street and Av. to Avenue, etc., after seeing the statistics (validity)
- dbpedia data set about cities - we find the error where density is in persons/square km, and area is in square meters, we also see arrays instead of single values, multiple timezones (uniformity)
- dbpedia data set about countries - we find some names which are not a country, or are arrays, column shift, by comparing the data to ISO country codes (accuracy)
- example with video and screen capture of an exam - less likelihood that both are missing (completness)
- example with "who do i trust the most" - e.g. there are 2 different addresses for same person - have to decide which one is more reliable, e.g. how was the data collected, which one is more accurate, etc. (consistency)
- example of counting numbers of data types (including null) (uniformity)
3.2. Exercise
4. Working with MongoDB
- flexible schema, easy to handle hierarchical data
- JSON documents - convenient for programmers
- flexible deployment (local or cloud)
4.1. Pymongo
- is python driver for mongo - keeps connection to database
- minimal example
- projection
- normally
_id
is included by default, unless specifies like above explicitly - insert
- mongoimport cmd line utility - imports whole JSON files to DB
- operators - start from "$"
- $gt, $lt, $gte, $lte, $ne
- $exists
- $regex
- queries work inside arrays
- also can work against other arrays, $in and $all
- $in
- $all
- can also access hierarchy
- save(obj) method - insert or update (if the _id exists and such object is in db)
- update(obj) - for (bulk) partial updates
- update + $set
- update + $unset
- bulk update
- remove() - removes all
- remove(query)
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
query = {"manufacturer" : "Porsche"}
return db.autos.find(query)
return db.autos.find_one(query)
return db.autos.find(query, {"_id":0, "name":1})
db.autos.insert(auto)
>mongoimport -d examplesdb -c autos --file autos.json
query = {"foundingDate" : {"$gte":datetime(2001,1,1), "$lte" : datetime(2099,12,31)}}
query = {"governmentType":{"$exists" : 1}}
query = {"assembly":{"$in":["Germany","United Kingdom","Japan"]}, "manufacturer":"Ford Motor Company"}
query = {"modelYears":{"$all":[1965, 1966, 1967]}}
query = {"dimensions.width":{"$gt":2.5}}
city = db.cities.update({"name":"Munich",
"country":"Germany"},
{"$set":{"isoCountryCode":"DEU"}})
city = db.cities.update({"name":"Munich",
"country":"Germany"},
{"$unset":{"isoCountryCode":
"blahblah_this is ignored"}})
city = db.cities.update({"name":"Munich",
"country":"Germany"},
{"$set":{"isoCountryCode":"DEU"}},
multi=True)
city = db.cities.remove()
4.2. Exercise
A lot of specific data cleaning and modifying, while copying it from CVS to mongo, row by row. It’s about arachnid (spiders) data set.
5. Analysing Data
- twitter data set
- followers, followees, tweets, tweet contents
- user id is called "screen_name"
- tasks like "find who tweeted the most"
5.1. Aggregation framework in MongoDB
- $group and $sort
- $skip and $limit - skip few first documents, or limit the output to number of documents only
- $match - for filtering, and $project
- $unwind - will multiply the record for each value in an array
- $sum, $first, $last, $max, $min, $avg
- $push and $addToSet
- indices - they are hierarchical; the hierarchy determines in which order an item can be searched - think about it<
- geospatial indices - don't search by exact point, but "near" to points ($near)
db = get_db('twitter')
pipeline = [
{"$group": {"_id":"$source",
"count": {"$sum": 1}}
},
{"$sort": {'count': -1}}]
result = db.tweets.aggregate(pipeline)
pipeline = [{"$match":{"user.time_zone":"Brasilia"}},
{"$match":{"user.statuses_count":{"$gte":100}}},
{"$project":{"followers":"$user.followers_count",
"screen_name":"$user.screen_name",
"tweets":"$user.statuses_count"}},
{"$sort":{"followers":-1}},
{"$limit":1}]
pipeline = [{"$match":{"country":"India"}},
{"$unwind":"$isPartOf"},
{"$group":{"_id":"$isPartOf", "count":{"$sum":1}}},
{"$sort":{"count":-1}},
{"$limit":1}]
pipeline = [{"$match":{"country":"India"}},
{"$unwind":"$isPartOf"},
{"$group":{"_id":"$isPartOf",
"avg":{"$avg":"$population"}}
},
{"$group":{"_id":"totalAvg",
"avg":{"$avg":"$avg"}}
}
]
pipeline = [{"$group":
{"_id":"$user.screen_name",
"tweet_texts":{"$push":"$text"},
"count":{"$sum":1}}},
{"$sort":{"count":-1}},
{"$limit":5}
]
db.autos.ensureIndex({"name": 1})
db.autos.ensureIndex(loc_field, direction)
5.2. Exercises
Just building different pipelines.
6. Case study
It’s about OpenStreetMap data set, which can be downloaded in XML from their site. You can download part which you are looking at, or download data of major cities. They also have a very nice wiki.
The data is XML with “node”s and “way”s (way = street, road, etc). The data is human edited, so it contains errors.
6.1. SAX XML parsing
import xml.etree.ElementTree as ET
for event, item in ET.iterparse(xml_filename, events=(“start”,))
handle_node(elem)
The non iterative parsing (reading all to memory at once) could go like:
tree = ET.parse(xml_filename)
root = tree.getroot()
for child in root:
handle_node(child)
- node.tag - name of XML node
- node.attrib - dictionary of node attributes
- node.attrib.keys() - array of attribute names
6.2 Regular expressions
import re
lower = re.compile(r'^([a-z]|_)*$')
re.findall(lower, string)
m = lower.search(string)
if m:
substring = m.group()
6.3 Exercise
It’s about parsing a XML document, and iterating over XML nodes and creating proper python dictionaries (specified in the task description).
Conclusions
- Reading the actual data helps a lot, makes you see misconceptions, and understand what is actually going on
- Read the task carefully and do not switch off thinking ;)
- Latitude comes before longitude
Comments
Comments: