Scraping OSM with Python 
The OpenStreetMap API lets you do lots of things with the OSM data, like uploading and downloading GPX traces. Unfortunately when you download GPX traces the data/time stamp has been removed. If you download GPS traces individually from the public gps list, then you get the raw original (I think) GPS data with date/time stamps.

Now since Birmingham is complete. I wanted to generate the party render for the entire city over the last 2 years. I neede a way to download 170+ traces. I certainly wasn't doing this by hand. What follows is my most hacky python script yet, it will parse a page and look for the GPS trace IDs, then construct a URL to download. You have to change the page to scrape manually, but hey I wrote it in 30mins what you do expect?


#! /usr/bin/env python

'''Python GPS downloaded for OSM'''

import sgmllib

class MyParser(sgmllib.SGMLParser):
"A simple parser class."

def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()

def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.openstreetmap.org/traces/tag/Birmingham/page/9")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
links = myparser.get_hyperlinks()
working = []
final = []
traces = []
nonduplicates = []

for i in links:
b = i.find("user")
print i, b
if b > 0:
working.append(i)

for i in working:
b = i.find("traces")
print i, b
if b > 0:
final.append(i)

for i in final:
parts = i.split("/")
lastitem = len(parts)-1
traces.append(parts[lastitem])

for i in traces:
if i not in nonduplicates:
nonduplicates.append(i)

for i in nonduplicates:
url = "http://www.openstreetmap.org/trace/" + i + "/data"
print url
trace = urllib.urlopen(url)
localfile = open(i, "w")
localfile.write(trace.read())
trace.close()
localfile.close()



Credit - the HTML parseing class was written by Boddie.

I plan on tidying this up (a lot), as it is extremely useful, until the OSM API catches up anyway.



[ add comment ]   |  permalink
Birmingham is Complete! 
There was agreement among the mappers of Birmingham in November to complete inside the motorways before Christmas, it was a mammoth task, but OpenStreetMap now has a complete map for the Birmingham area! This is thanks to the hard work of multiple mappers in the Birmingham area, but most of all to Andy Robinson, BrianBoru and Xoff. These 3 guys between themselves mapped more than 75% of Birmingham. My own efforts only managed to contribute about 1.3%!

The announcement by Andy can be found here.

This is not to say there is anything left to do, as there are plenty of things to add to the map. I for one am going to keep adding data.

The next step is to get this data into the hands of people who can use it. Please send in suggestions, or edit the Mappa Mercia page on the OSM wiki to include anything you think is useful.

[ add comment ]   |  permalink  |  related link
Directoryhash 1.4 
Noticed a few problems with the script, firstly the fact I hard-coded the maximum file size into the script.It's now an option on the command line.

Also added some extra file detection stuff. The the file is zero bytes big it now doesn't bother adding it to the hash list. But it is recorded in the output file. Same for files bigger than the maximum.


#! /usr/bin/env python

''' A python program that walks a given directory to find files that are
duplicated. It then outputs the results to console (simply printing a
dictionary), and an output file.

command line parameters

./directoryhash_1.4.py [root directory] [outputfile] [max file size in bytes]
'''

import os
import sys
import md5

hashes = {} # The "working" hashes dictionary
final = {} # The final dictionary with the all the duplicated files,
zerobytes = [] # with their hashes as keys. A list of files with zero bytes.
toobig = [] # Files that were too big.

rootpath = sys.argv[1]
outputfile = open(sys.argv[2], "w")
maxfile = sys.argv[3]
maxfile = long(maxfile)

def hashfunction(filetohash):
''' Takes a filetohash, hashses it with md5 checksum thingy, then checks to see if
that hash already exists. If not it adds it to a dictionary of files, where their
hash is the key value
'''
try:
openedfile = open(filetohash, "rb")
# print openedfile
filehash = md5.new(openedfile.read()).hexdigest()
# print filehash
if filehash not in hashes:
hashes[filehash] = [filetohash]
else:
hashes[filehash].append(filetohash)

except IOError:
pass
print "\n"
print filetohash
print "Probably a directory. Ignoring"

# The following section walks the directory from the rootpath.
# It then calles the hashing() function to do the checking etc.

for dirpath, directories, files in os.walk(rootpath):
for i in files:
filepath = dirpath + "/" + i
print filepath
try:
if os.path.getsize(filepath) > maxfile:
print filepath + "\n" + "Too big!"
toobig.append(filepath)
elif os.path.getsize(filepath) < maxfile and os.path.getsize(filepath) > 0:
hashfunction(filepath)
elif os.path.getsize(filepath) <= 0:
zerobytes.append(filepath)

except OSError:
# Handles errors with the filenames, usually seems to be because
# of file locking etc. Not sure. Don't care.
print "BORK!"


# Checks the dictionary of hashes and discards all entries where
# there is only one file per hash. (ie the file is unique)

for j in hashes:
if len(hashes[j]) >= 2:
final[j] = hashes[j]



print "\n"

# Takes the final dictionary, and writes the ouput to a text
# file so its useful.

if len(final) > 0:
print "Duplicates found \nCheck output file \n" + "-" * 20
for l in final:
outputfile.write("hash: " + l + "\n")
for i in final[l]:
outputfile.write(i + "\n")

outputfile.write("-" * 20 + "\n\n")
else:
print "No duplicates found! \n" + "-" * 20
outputfile.write("No Duplicates found!\n\n")

if len(zerobytes) > 0:
outputfile.write("Empty files \n" + "-" * 20 + "\n")
for m in zerobytes:
outputfile.write(m + "\n")
outputfile.write("-" * 20 + "\n\n")

if len(toobig) > 0:
outputfile.write("Files bigger than " + str(maxfile) + " bytes" + "\n" + "-" * 20 + "\n")
for m in toobig:
outputfile.write(m + "\n")
outputfile.write("-" * 20 + "\n\n")

outputfile.close()


Enjoy!


[ add comment ]   |  permalink
Taking hash(es) for good causes 
I am now at a point where I can start writing useful python scripts, but finding things to write scripts to do is something I find hard.

After speaking with Martin Hellwig and Alex Wilmer after a Python WM meeting, I found out there is an interesting set of applications that tell you if you have duplicated files on your computer. They are nothing exceptional just a program that indexes your file system, hashes all the files, then compare the hashes. Companies can charge quite a lot for these applications but I thought "I reckon I know enough python to do that myself!"

So here is the result of that.


#! /usr/bin/env python

''' A python program that walks a given directory to find files that are
duplicated. It then outputs the results to console (simply printing a
dictionary), and an output file.

command line parameters ./directoryhash_1.3.py [root directory] [outputfile]
'''

import os
import sys
import md5

hashes = {} # The "working" hashes dictionary
final = {} # The final dictionary with the all the duplicated files,
# with their hashes as keys

rootpath = sys.argv[1]
outputfile = open(sys.argv[2], "w")

def hashfunction(filetohash):
''' Takes a filetohash, hashses it with md5 checksum thingy, then checks to see if
that hash already exists. If not it adds it to a dictionary of files, where their
hash is the key value
'''
try:
openedfile = open(filetohash, "rb")
# print openedfile
filehash = md5.new(openedfile.read()).hexdigest()
# print filehash
if filehash not in hashes:
hashes[filehash] = [filetohash]
else:
hashes[filehash].append(filetohash)

except IOError:
pass
print "\n"
print filetohash
print "Probably a directory. Ignoring"

# The following section walks the directory from the rootpath.
# It then calls the hashing() function to do the checking etc.

for dirpath, directories, files in os.walk(rootpath):
for i in files:
filepath = dirpath + "/" + i
print filepath
try:
if os.path.getsize(filepath) < 157286400:
hashfunction(filepath)
else:
print filepath + "\n" + "Too big!"
continue
except OSError:
# Handles errors with the filenames, usually seems to be because
# of file locking etc. Not sure. Don't care.

print "BORK!"


# Checks the dictionary of hashes and discards all entries where
# there is only one file per hash. (ie the file is unique)

for j in hashes:
if len(hashes[j]) >= 2:
final[j] = hashes[j]


print 20 * "-"
print "All Files and Hashes"
print hashes

print "\n"

print 20 * "-"
print "Duplicated Files"
for k in final:
print final[k]

print "\n"

# Takes the final dictionary, and writes the output to a text
# file so its useful.

for l in final:
outputfile.write("hash: " + l + "\n")
for i in final[l]:
outputfile.write(i + "\n")

outputfile.write("-" * 20 + "\n\n")

outputfile.close()



It's crude looking and could probably do with some clean up and extra error catching, but it works. It outputs a text file with all the duplicate files that it found. Simple.

I find that the "openedfile = open(filetohash, "rb")" bit has the annoying habbit of printing out what it has opened, and I am unsure of how to change this. Any suggestions welcome.

It was suggested that I used the OpenSSL python library to do the hashing, but I couldn't get my head around it quick enough so bottled it and went with the standard libraries md5sum.



[ add comment ]   |  permalink
Making Movies Maw! 
Stop frame animation - Linux style

As part of my masters project I will need a way of constructing movies from single images.

I have been having some issues with ffmpeg, and totem. Totem has been screwing up files somehow....

Here is the command I am using.

ffmpeg -f image2 -r 10 -qscale 1 -i %03d.jpg i.avi

-r = frame rate

-qsacle = VBR encoding quality (1-31)

This outputs you a nice little movie. Play with the -qscale option to get the output quality you want.

This has taken me a week to do, thanks Totem and your bizarre treatment of media files.



[ add comment ] ( 2 views )   |  permalink

<<First <Back | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | Next> Last>>