Simple MP3 file downloader

Prerequisites

Sumlime text, python, requests, rss parser

First off, I had to know how to build python modules in sublime text. The followings are the guides from the official and unofficial web sites. With comprehension of these guides, I can write build script that fit.

That being said, in practice I searched google for some examples and found the following codes.

{
    "cmd": ["<<<your path to python file>>>\\python.exe", "-u", "$file",],
    "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
    "selector": "source.python"
}

NOTE: $file is a variable used in sublime text build framework. It refers to the file in an active view. Aside from that, I still don’t understand why do I need -u opriton for running a python module.

While I carry out a dry run for requests library with the following code, I got a UnicodeEncodeError. Until that point, I’ve misunderstood the python’s text model. I considered the problem to be related with decoding of the characters. However, the characters are already stored into a string object called “r.text”, which means those are already a unicode string. As it turns out, when I print out the string, the codec for output, which is ‘cp949’, tries to encode this string to that of suitable charset for Windows. In that procedure the following encoding error occured. For more information about codec, visit Python official site

# -*- coding: utf-8 -*-
import requests
rss = 'http://feeds.bbci.co.uk/learningenglish/english/features/6-minute-english/rss'
r = requests.get(rss)
print(r.text)

UnicodeEncodeError: ‘cp949’ codec can’t encode character ‘\xe2’ in position 2347: illegal multibyte sequences

NOTE: A short summaray about python3’s text model

An object of 'str' class is unicode.
The object can be encoded into 'bytes' with whatsoever character encoding is.
When you recieve bytes from outer spaces like web server, the encoding of bytes has to be known to convert it to a 'str' object.
Unicode Sandwich!

Fortunately, we don’t have to print out all the lines of the received rss file. In addition to it, if I have to print out them all, ‘try … except’ can be adopted.

Before starting writing codes to get the mp3 files from BBC Learning English programs, I was unware of the existance of ‘feedparser’. If you use feedparser, it gets quite simple. For more information, visit python wiki page and pypi.

I managed to wirte a code for retrieving mp3 files, but the performance is so slow that I can take a long rest with my colleague after run this code. To remedy this, I decided to use threads.

import requests
from bs4 import BeautifulSoup
import gc

rss = feedparser.parse('http://feeds.bbci.co.uk/learningenglish/english/features/6-minute-english/rss')

for item in rss.entries:
    p = requests.get(item.link)
    soup = BeautifulSoup(p.text, 'html.parser')

    try:
        link = soup.find('a', class_='bbcle-download-extension-mp3')['href']
        name = link.split('/')[-1]

        mp3 = requests.get(link)

        print(len(mp3.content))

        #with open(name, 'wb') as f:
        #   f.write(mp3.content)

        gc.collect()

    except TypeError as e:
        print('the page does not include a mp3 file')

    del(p, soup)

Before adopting threads to this application, profiling performance would be a good start point to figure out what is wrong behind the code. After running profiling, the following result is emitted.

Python profilers

ncalls  tottime percall cumtime percall filename:lineno(function)
105179  490.659 0.005   490.659 0.005   {method 'recv_into' of  '_socket.socket'    objects}
114     36.274  0.318   36.275  0.318   {method 'connect'   of  '_socket.socket'    objects}
114     1.871   0.016   1.876   0.016   {built-in   method  getaddrinfo}
63978   0.463   0       2.225   0       parser.py:360(parse_starttag)
48440   0.43    0       348.58  0.007   {function   HTTPResponse.read   at  0x020BDC90}
57      0.394   0.007   3.962   0.07    parser.py:193(goahead)
48440   0.34    0       347.912 0.007   {method 'readinto'  of  '_io.BufferedReader'    objects}
451569  0.335   0       0.335   0       {method 'match' of  '_sre.SRE_Pattern'  objects}
913     0.319   0       350.123 0.383   {method 'join'  of  'bytes' objects}
48549   0.317   0       349.529 0.007   response.py:205(read)
52      0.274   0.005   0.274   0.005   {method 'write' of  '_io.BufferedWriter'    objects}
174807  0.248   0       0.252   0       element.py:189(setup)
105179  0.24    0       491.237 0.005   socket.py:357(readinto)
48440   0.227   0       348.151 0.007   client.py:520(readinto)
64035   0.204   0       0.598   0       element.py:779(__init__)
955326  0.203   0       0.429   0       {built-in   method  isinstance}
63978   0.191   0       1.138   0       __init__.py:386(handle_starttag)
56507   0.171   0       0.809   0       parser.py:463(parse_endtag)
49654   0.169   0       0.313   0       _collections_abc.py:422(get)
129331  0.164   0       0.528   0       __init__.py:287(endData)
39384   0.163   0       0.247   0       __init__.py:148(_replace_cdata_list_attribute_values)
105179  0.155   0       0.155   0       {method '_checkClosed'  of  '_io._IOBase'   objects}
181965  0.153   0       0.153   0       {method 'search'    of  '_sre.SRE_Pattern'  objects}
247523  0.149   0       0.206   0       _markupbase.py:48(updatepos)
135456  0.146   0       0.226   0       abc.py:178(__instancecheck__)
48550   0.141   0       348.724 0.007   client.py:489(read)
48662   0.127   0       0.127   0       response.py:1(is_fp_closed)
63246   0.126   0       0.196   0       __init__.py:363(_popToTag)
50079   0.125   0       0.143   0       _collections.py:154(__getitem__)
63978   0.116   0       1.254   0       _htmlparser.py:52(handle_starttag)
48549   0.115   0       0.43    0       response.py:176(_init_decoder)
64035   0.112   0       0.131   0       __init__.py:278(pushTag)
48549   0.112   0       349.768 0.007   response.py:286(stream)
103466  0.11    0       0.777   0       element.py:1627(search)
57038   0.109   0       0.606   0       element.py:1586(search_tag)
105293  0.098   0       0.098   0       socket.py:398(readable)
1       0.093   0.093   537.655 537.655 Deaton.py:25(get_mp3)

This result showed that receiving web contents took a large portion due to network latency. So I concluded that adopting threads to carry out receiving data from network is quite reasonable. Thus, I wrote codes like below.

It showed better performance and I was satisfied, but still there would be some modification in the design of whole program in terms of maintenance or gui.