pandasql: Make python speak SQL

Standard

http://blog.yhat.com/posts/pandasql-intro.html

Introduction

One of my favorite things about Python is that users get the benefit of observing the R community and then emulating the best parts of it. I’m a big believer that a language is only as helpful as its libraries and tools.

This post is about pandasql, a Python package we (Yhat) wrote that emulates the R package sqldf. It’s a small but mighty library comprised of just 358 lines of code. The idea of pandasql is to make Python speak SQL. For those of you who come from a SQL-first background or still “think in SQL”, pandasql is a nice way to take advantage of the strengths of both languages.

In this introduction, we’ll show you to get up and running with pandasql inside of Rodeo, the integrated development environment (IDE) we built for data exploration and analysis. Rodeo is an open source and completely free tool. If you’re an R user, its a comparable tool with a similar feel to RStudio. As of today, Rodeo can only run Python code, but last week we added syntax highlighting for a bunch of other languages to the editor (markdown, JSON, julia, SQL, markdown). As you may have read or guessed, we’ve got big plans for Rodeo, including adding SQL support so that you can run your SQL queries right inside of Rodeo, even without our handy little pandasql. More on that in the next week or two!

Downloading Rodeo

Start by downloading Rodeo for Mac, Windows or Linux from the Rodeo page on the Yhat website.

ps If you download Rodeo and encounter a problem or simply have a question, we monitor our discourse forum 24/7 (okay, almost).

A bit of background, if you’re curious

Behind the scenes, pandasql uses the pandas.io.sql module to transfer data between DataFrame and SQLite databases. Operations are performed in SQL, the results returned, and the database is then torn down. The library makes heavy use of pandas write_frame and frame_query, two functions which let you read and write to/from pandas and (most) any SQL database.

Install pandasql

Install pandasql using the package manager pane in Rodeo. Simply search for pandasql and click Install Package.

You can also run ! pip install pandasql from the text editor if you prefer to install that way.

Check out the datasets

pandasql has two built-in datasets which we’ll use for the examples below.

  • meat: Dataset from the U.S. Dept. of Agriculture containing metrics on livestock, dairy, and poultry outlook and production
  • births: Dataset from the United Nations Statistics Division containing demographic statistics on live births by month

Run the following code to check out the data sets.

<code>#Checking out meat and birth data
from pandasql import sqldf
from pandasql import load_meat, load_births

meat = load_meat()
births = load_births()

#You can inspect the dataframes directly if you're using Rodeo
#These print statements are here just in case you want to check out your data in the editor, too
print meat.head()
print births.head()
</code>

Inside Rodeo, you really don’t even need the print.variable.head() statements, since you can actually just examine the dataframes directly.

An odd graph

<code># Let's make a graph to visualize the data
# Bet you haven't had a title quite like this before
import matplotlib.pyplot as plt
from pandasql import *
import pandas as pd

pysqldf = lambda q: sqldf(q, globals())

q  = """
SELECT
  m.date
  , m.beef
  , b.births
FROM
  meat m
LEFT JOIN
  births b
    ON m.date = b.date
WHERE
    m.date &gt; '1974-12-31';
"""

meat = load_meat()
births = load_births()

df = pysqldf(q)
df.births = df.births.fillna(method='backfill')

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(pd.rolling_mean(df['beef'], 12), color='b')
ax1.set_xlabel('months since 1975')
ax1.set_ylabel('cattle slaughtered', color='b')

ax2 = ax1.twinx()
ax2.plot(pd.rolling_mean(df['births'], 12), color='r')
ax2.set_ylabel('babies born', color='r')
plt.title("Beef Consumption and the Birth Rate")
plt.show()
</code>

Notice that the plot appears both in the console and the plot tab (bottom right tab).

Tip: You can “pop out” your plot by clicking the arrows at the top of the pane. This is handy if you’re working on multiple monitors and want to dedicate one just to your data visualzations.

Usage

To keep this post concise and easy to read, we’ve just given the code snippets and a few lines of results for most of the queries below.

If you’re following along in Rodeo, a few tips as you’re getting started:

  • Run Script will indeed run everything you have written in the text editor
  • You can highlight a code chunk and run it by clicking Run Line or pressing Command + Enter
  • You can resize the panes (when I’m not making plots I shrink down the bottom right pane)

Basics

Write some SQL and execute it against your pandas DataFrame by substituting DataFrames for tables.

<code>q = """
    SELECT
        *
    FROM
        meat
    LIMIT 10;"""

print sqldf(q, locals())

#                   date  beef  veal  pork  lamb_and_mutton broilers other_chicken turkey
# 0  1944-01-01 00:00:00   751    85  1280               89     None          None   None
# 1  1944-02-01 00:00:00   713    77  1169               72     None          None   None
# 2  1944-03-01 00:00:00   741    90  1128               75     None          None   None
# 3  1944-04-01 00:00:00   650    89   978               66     None          None   None
</code>

pandasql creates a DB, schema and all, loads your data, and runs your SQL.

Aggregation

pandasql supports aggregation. You can use aliased column names or column numbers in your group byclause.

<code># births per year
q = """
    SELECT
        strftime("%Y", date)
        , SUM(births)
    FROM births
    GROUP BY 1
    ORDER BY 1;
            """

print sqldf(q, locals())

#    strftime("%Y", date)  SUM(births)
# 0                  1975      3136965
# 1                  1976      6304156
# 2                  1979      3333279
# 3                  1982      3612258
</code>

locals() vs. globals()

pandasql needs to have access to other variables in your session/environment. You can pass locals() to pandasql when executing a SQL statement, but if you’re running a lot of queries that might be a pain. To avoid passing locals all the time, you can add this helper function to your script to set globals() like so:

<code>def pysqldf(q):
    return sqldf(q, globals())

q = """
    SELECT
        *
    FROM
        births
    LIMIT 10;"""

print pysqldf(q)
# 0  1975-01-01 00:00:00  265775
# 1  1975-02-01 00:00:00  241045
# 2  1975-03-01 00:00:00  268849
</code>

joins

You can join dataframes using normal SQL syntax.

<code># joining meats + births on date
q = """
    SELECT
        m.date
        , b.births
        , m.beef
    FROM
        meat m
    INNER JOIN
        births b
            on m.date = b.date
    ORDER BY
        m.date
    LIMIT 100;
    """

joined = pysqldf(q)
print joined.head()
#date  births    beef
#0  1975-01-01 00:00:00.000000  265775  2106.0
#1  1975-02-01 00:00:00.000000  241045  1845.0
#2  1975-03-01 00:00:00.000000  268849  1891.0
</code>

WHERE conditions

Here’s a WHERE clause.

<code>q = """
    SELECT
        date
        , beef
        , veal
        , pork
        , lamb_and_mutton
    FROM
        meat
    WHERE
        lamb_and_mutton &gt;= veal
    ORDER BY date DESC
    LIMIT 10;
    """

print pysqldf(q)
#                   date    beef  veal    pork  lamb_and_mutton
# 0  2012-11-01 00:00:00  2206.6  10.1  2078.7             12.4
# 1  2012-10-01 00:00:00  2343.7  10.3  2210.4             14.2
# 2  2012-09-01 00:00:00  2016.0   8.8  1911.0             12.5
# 3  2012-08-01 00:00:00  2367.5  10.1  1997.9             14.2
</code>

It’s just SQL

Since pandasql is powered by SQLite3, you can do most anything you can do in SQL. Here are some examples using common SQL features such as subqueries, order by, functions, and unions.

<code>#################################################
# SQL FUNCTIONS
# e.g. `RANDOM()`
#################################################
q = """SELECT
    *
    FROM
        meat
    ORDER BY RANDOM()
    LIMIT 10;"""
print pysqldf(q)
#                   date  beef  veal  pork  lamb_and_mutton  broilers other_chicken  turkey
# 0  1967-03-01 00:00:00  1693    65  1136               61     472.0          None    26.5
# 1  1944-12-01 00:00:00   764   146  1013               91       NaN          None     NaN
# 2  1969-06-01 00:00:00  1666    50   964               42     573.9          None    85.4
# 3  1983-03-01 00:00:00  1892    37  1303               36    1106.2          None   182.7

#################################################
# UNION ALL
#################################################
q = """
        SELECT
            date
            , 'beef' AS meat_type
            , beef AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'veal' AS meat_type
            , veal AS value
        FROM meat

        UNION ALL

        SELECT
            date
            , 'pork' AS meat_type
            , pork AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'lamb_and_mutton' AS meat_type
            , lamb_and_mutton AS value
        FROM meat
        ORDER BY 1
    """
print pysqldf(q).head(20)
#                    date        meat_type  value
# 0   1944-01-01 00:00:00             beef    751
# 1   1944-01-01 00:00:00             veal     85
# 2   1944-01-01 00:00:00             pork   1280
# 3   1944-01-01 00:00:00  lamb_and_mutton     89


#################################################
# subqueries
# fancy!
#################################################
q = """
    SELECT
        m1.date
        , m1.beef
    FROM
        meat m1
    WHERE m1.date IN
        (SELECT
            date
        FROM meat
        WHERE
            beef &gt;= broilers
        ORDER BY date)
"""

more_beef_than_broilers = pysqldf(q)
print more_beef_than_broilers.head(10)
#                   date  beef
# 0  1960-01-01 00:00:00  1196
# 1  1960-02-01 00:00:00  1089
# 2  1960-03-01 00:00:00  1201
# 3  1960-04-01 00:00:00  1066
</code>

Final thoughts

pandas is an incredible tool for data analysis in large part, we think, because it is extremely digestible, succinct, and expressive. Ultimately, there are tons of reasons to learn the nuances of mergejoinconcatenatemelt and other native pandas features for slicing and dicing data. Check out the docs for some examples.

Our hope is that pandasql will be a helpful learning tool for folks new to Python and pandas. In my own personal experience learning R, sqldf was a familiar interface helping me become highly productive with a new tool as quickly as possible.

关于python文件操作

Standard

python中对文件、文件夹(文件操作函数)的操作需要涉及到os模块和shutil模块。

得到当前工作目录,即当前Python脚本工作的目录路径: os.getcwd()

返回指定目录下的所有文件和目录名:os.listdir()

函数用来删除一个文件:os.remove()

删除多个目录:os.removedirs(r“c:\python”)

检验给出的路径是否是一个文件:os.path.isfile()

检验给出的路径是否是一个目录:os.path.isdir()

判断是否是绝对路径:os.path.isabs()

检验给出的路径是否真地存:os.path.exists()

返回一个路径的目录名和文件名:os.path.split()     eg os.path.split(‘/home/swaroop/byte/code/poem.txt’) 结果:(‘/home/swaroop/byte/code’, ‘poem.txt’)

分离扩展名:os.path.splitext()

获取路径名:os.path.dirname()

获取文件名:os.path.basename()

运行shell命令: os.system()

读取和设置环境变量:os.getenv() 与os.putenv()

给出当前平台使用的行终止符:os.linesep    Windows使用’\r\n’,Linux使用’\n’而Mac使用’\r’

指示你正在使用的平台:os.name       对于Windows,它是’nt’,而对于Linux/Unix用户,它是’posix’

重命名:os.rename(old, new)

创建多级目录:os.makedirs(r“c:\python\test”)

创建单个目录:os.mkdir(“test”)

获取文件属性:os.stat(file)

修改文件权限与时间戳:os.chmod(file)

终止当前进程:os.exit()

获取文件大小:os.path.getsize(filename)
文件操作:
os.mknod(“test.txt”)        创建空文件
fp = open(“test.txt”,w)     直接打开一个文件,如果文件不存在则创建文件

关于open 模式:

w     以写方式打开,
a     以追加模式打开 (从 EOF 开始, 必要时创建新文件)
r+     以读写模式打开
w+     以读写模式打开 (参见 w )
a+     以读写模式打开 (参见 a )
rb     以二进制读模式打开
wb     以二进制写模式打开 (参见 w )
ab     以二进制追加模式打开 (参见 a )
rb+    以二进制读写模式打开 (参见 r+ )
wb+    以二进制读写模式打开 (参见 w+ )
ab+    以二进制读写模式打开 (参见 a+ )

 

fp.read([size])                     #size为读取的长度,以byte为单位

fp.readline([size])                 #读一行,如果定义了size,有可能返回的只是一行的一部分

fp.readlines([size])                #把文件每一行作为一个list的一个成员,并返回这个list。其实它的内部是通过循环调用readline()来实现的。如果提供size参数,size是表示读取内容的总长,也就是说可能只读到文件的一部分。

fp.write(str)                      #把str写到文件中,write()并不会在str后加上一个换行符

fp.writelines(seq)            #把seq的内容全部写到文件中(多行一次性写入)。这个函数也只是忠实地写入,不会在每行后面加上任何东西。

fp.close()                        #关闭文件。python会在一个文件不用后自动关闭文件,不过这一功能没有保证,最好还是养成自己关闭的习惯。  如果一个文件在关闭后还对其进行操作会产生ValueError

fp.flush()                                      #把缓冲区的内容写入硬盘

fp.fileno()                                      #返回一个长整型的”文件标签“

fp.isatty()                                      #文件是否是一个终端设备文件(unix系统中的)

fp.tell()                                         #返回文件操作标记的当前位置,以文件的开头为原点

fp.next()                                       #返回下一行,并将文件操作标记位移到下一行。把一个file用于for … in file这样的语句时,就是调用next()函数来实现遍历的。

fp.seek(offset[,whence])              #将文件打操作标记移到offset的位置。这个offset一般是相对于文件的开头来计算的,一般为正数。但如果提供了whence参数就不一定了,whence可以为0表示从头开始计算,1表示以当前位置为原点计算。2表示以文件末尾为原点进行计算。需要注意,如果文件以a或a+的模式打开,每次进行写操作时,文件操作标记会自动返回到文件末尾。

fp.truncate([size])                       #把文件裁成规定的大小,默认的是裁到当前文件操作标记的位置。如果size比文件的大小还要大,依据系统的不同可能是不改变文件,也可能是用0把文件补到相应的大小,也可能是以一些随机的内容加上去。

 

目录操作:
os.mkdir(“file”)                   创建目录
复制文件:
shutil.copyfile(“oldfile”,”newfile”)       oldfile和newfile都只能是文件
shutil.copy(“oldfile”,”newfile”)            oldfile只能是文件夹,newfile可以是文件,也可以是目标目录
复制文件夹:
shutil.copytree(“olddir”,”newdir”)        olddir和newdir都只能是目录,且newdir必须不存在
重命名文件(目录)
os.rename(“oldname”,”newname”)       文件或目录都是使用这条命令
移动文件(目录)
shutil.move(“oldpos”,”newpos”)   
删除文件
os.remove(“file”)
删除目录
os.rmdir(“dir”)只能删除空目录
shutil.rmtree(“dir”)    空目录、有内容的目录都可以删
转换目录
os.chdir(“path”)   换路径

 

相关例子 

 1 将文件夹下所有图片名称加上’_fc’

python代码:

# -*- coding:utf-8 -*-
import re
import os
import time
#str.split(string)分割字符串
#’连接符’.join(list) 将列表组成字符串
def change_name(path):
global i
if not os.path.isdir(path) and not os.path.isfile(path):
return False
if os.path.isfile(path):
file_path = os.path.split(path) #分割出目录与文件
lists = file_path[1].split(‘.’) #分割出文件与文件扩展名
file_ext = lists[-1] #取出后缀名(列表切片操作)
img_ext = [‘bmp’,’jpeg’,’gif’,’psd’,’png’,’jpg’]
if file_ext in img_ext:
os.rename(path,file_path[0]+’/’+lists[0]+’_fc.’+file_ext)
i+=1 #注意这里的i是一个陷阱
#或者
#img_ext = ‘bmp|jpeg|gif|psd|png|jpg’
#if file_ext in img_ext:
#    print(‘ok—‘+file_ext)
elif os.path.isdir(path):
for x in os.listdir(path):
change_name(os.path.join(path,x)) #os.path.join()在路径处理上很有用
img_dir = ‘D:\\xx\\xx\\images’
img_dir = img_dir.replace(‘\\’,’/’)
start = time.time()
i = 0
change_name(img_dir)
c = time.time() – start
print(‘程序运行耗时:%0.2f’%(c))
print(‘总共处理了 %s 张图片’%(i))

输出结果:

程序运行耗时:0.11
总共处理了 109 张图片

11 Python Libraries You Might Not Know

Standard
原文:http://blog.yhathq.com/posts/11-python-libraries-you-might-not-know.html

There are tons of Python packages out there. So many that no one man or woman could possibly catch them all. PyPialone has over 47,000 packages listed!

Recently, with so many data scientists making the switch to Python, I couldn’t help but think that while they’re getting some of the great benefits of pandasscikit-learn, and numpy, they’re missing out on some older yet equally helpful Python libraries.

In this post, I’m going to highlight some lesser-known libraries. Even you experienced Pythonistas should take a look, there might be one or two in there you’ve never seen!

1) delorean

Delorean is a really cool date/time library. Apart from having a sweet name, it’s one of the more natural feeling date/time munging libraries I’ve used in Python. It’s sort of like moment in javascript, except I laugh every time I import it. The docs are also good and in addition to being technically helpful, they also make countless Back to the Futurereferences.

<code><span class="kwd">from</span><span class="pln"> delorean </span><span class="kwd">import</span> <span class="typ">Delorean</span><span class="pln">
EST </span><span class="pun">=</span> <span class="str">"US/Eastern"</span><span class="pln">
d </span><span class="pun">=</span> <span class="typ">Delorean</span><span class="pun">(</span><span class="pln">timezone</span><span class="pun">=</span><span class="pln">EST</span><span class="pun">)</span></code>

2) prettytable

There’s a chance you haven’t heard of prettytable because it’s listed on GoogleCode, which is basically the coding equivalent of Siberia.

Despite being exiled to a cold, snowy and desolate place, prettytable is great for constructing output that looks good in the terminal or in the browser. So if you’re working on a new plug-in for the IPython Notebook, check out prettytable for your HTML __repr__.

<code><span class="kwd">from</span><span class="pln"> prettytable </span><span class="kwd">import</span> <span class="typ">PrettyTable</span><span class="pln">
table </span><span class="pun">=</span> <span class="typ">PrettyTable</span><span class="pun">([</span><span class="str">"animal"</span><span class="pun">,</span> <span class="str">"ferocity"</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"wolverine"</span><span class="pun">,</span> <span class="lit">100</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"grizzly"</span><span class="pun">,</span> <span class="lit">87</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"Rabbit of Caerbannog"</span><span class="pun">,</span> <span class="lit">110</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"cat"</span><span class="pun">,</span> <span class="pun">-</span><span class="lit">1</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"platypus"</span><span class="pun">,</span> <span class="lit">23</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"dolphin"</span><span class="pun">,</span> <span class="lit">63</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">add_row</span><span class="pun">([</span><span class="str">"albatross"</span><span class="pun">,</span> <span class="lit">44</span><span class="pun">])</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">sort_key</span><span class="pun">(</span><span class="str">"ferocity"</span><span class="pun">)</span><span class="pln">
table</span><span class="pun">.</span><span class="pln">reversesort </span><span class="pun">=</span> <span class="kwd">True</span>
<span class="pun">+----------------------+----------+</span>
<span class="pun">|</span><span class="pln">        animal        </span><span class="pun">|</span><span class="pln"> ferocity </span><span class="pun">|</span>
<span class="pun">+----------------------+----------+</span>
<span class="pun">|</span> <span class="typ">Rabbit</span><span class="pln"> of </span><span class="typ">Caerbannog</span> <span class="pun">|</span>   <span class="lit">110</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">      wolverine       </span><span class="pun">|</span>   <span class="lit">100</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">       grizzly        </span><span class="pun">|</span>    <span class="lit">87</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">       dolphin        </span><span class="pun">|</span>    <span class="lit">63</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">      albatross       </span><span class="pun">|</span>    <span class="lit">44</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">       platypus       </span><span class="pun">|</span>    <span class="lit">23</span>    <span class="pun">|</span>
<span class="pun">|</span><span class="pln">         cat          </span><span class="pun">|</span>    <span class="pun">-</span><span class="lit">1</span>    <span class="pun">|</span>
<span class="pun">+----------------------+----------+</span></code>

3) snowballstemmer

Ok so the first time I installed snowballstemmer, it was because I thought the name was cool. But it’s actually a pretty slick little library. snowballstemmer will stem words in 15 different languages and also comes with a porter stemmer to boot.

<code><span class="kwd">from</span><span class="pln"> snowballstemmer </span><span class="kwd">import</span> <span class="typ">EnglishStemmer</span><span class="pun">,</span> <span class="typ">SpanishStemmer</span>
<span class="typ">EnglishStemmer</span><span class="pun">().</span><span class="pln">stemWord</span><span class="pun">(</span><span class="str">"Gregory"</span><span class="pun">)</span>
<span class="com"># Gregori</span>
<span class="typ">SpanishStemmer</span><span class="pun">().</span><span class="pln">stemWord</span><span class="pun">(</span><span class="str">"amarillo"</span><span class="pun">)</span>
<span class="com"># amarill</span></code>

4) wget

Remember every time you wrote that web crawler for some specific purpose? Turns out somebody built it…and it’s called wget. Recursively download a website? Grab every image from a page? Sidestep cookie traces? Done, done, and done.

Movie Mark Zuckerberg even says it himself

First up is Kirkland, they keep everything open and allow indexes on their apache configuration, so a little wget magic is enough to download the entire Kirkland facebook. Kid stuff!

The Python version comes with just about every feature you could ask for and is easy to use.

<code><span class="kwd">import</span><span class="pln"> wget
wget</span><span class="pun">.</span><span class="pln">download</span><span class="pun">(</span><span class="str">"http://www.cnn.com/"</span><span class="pun">)</span>
<span class="com"># 100% [............................................................................] 280385 / 280385</span></code>

Note that another option for linux and osx users would be to use do: from sh import wget. However the Python wget module does have a better argument handline.

5) PyMC

I’m not sure how PyMC gets left out of the mix so often. scikit-learn seems to be everyone’s darling (as it should, it’s fantastic), but in my opinion, not enough love is given to PyMC.

<code><span class="kwd">from</span><span class="pln"> pymc</span><span class="pun">.</span><span class="pln">examples </span><span class="kwd">import</span><span class="pln"> disaster_model
</span><span class="kwd">from</span><span class="pln"> pymc </span><span class="kwd">import</span><span class="pln"> MCMC
M </span><span class="pun">=</span><span class="pln"> MCMC</span><span class="pun">(</span><span class="pln">disaster_model</span><span class="pun">)</span><span class="pln">
M</span><span class="pun">.</span><span class="pln">sample</span><span class="pun">(</span><span class="pln">iter</span><span class="pun">=</span><span class="lit">10000</span><span class="pun">,</span><span class="pln"> burn</span><span class="pun">=</span><span class="lit">1000</span><span class="pun">,</span><span class="pln"> thin</span><span class="pun">=</span><span class="lit">10</span><span class="pun">)</span>
<span class="pun">[-----------------</span><span class="lit">100</span><span class="pun">%-----------------]</span> <span class="lit">10000</span><span class="pln"> of </span><span class="lit">10000</span><span class="pln"> complete </span><span class="kwd">in</span> <span class="lit">1.4</span><span class="pln"> sec</span></code>

If you don’t already know it, PyMC is a library for doing Bayesian analysis. It’s featured heavily in Cam Davidson-Pilon’s Bayesian Methods for Hackers and has made cameos on a lot of popular data science/python blogs, but has never received the cult following akin to scikit-learn.

6) sh

I can’t risk you leaving this page and not knowing about shsh lets you import shell commands into Python as functions. It’s super useful for doing things that are easy in bash but you can’t remember how to do in Python (i.e. recursively searching for files).

<code><span class="kwd">from</span><span class="pln"> sh </span><span class="kwd">import</span><span class="pln"> find
find</span><span class="pun">(</span><span class="str">"/tmp"</span><span class="pun">)</span>
<span class="pun">/</span><span class="pln">tmp</span><span class="pun">/</span><span class="pln">foo
</span><span class="pun">/</span><span class="pln">tmp</span><span class="pun">/</span><span class="pln">foo</span><span class="pun">/</span><span class="pln">file1</span><span class="pun">.</span><span class="pln">json
</span><span class="pun">/</span><span class="pln">tmp</span><span class="pun">/</span><span class="pln">foo</span><span class="pun">/</span><span class="pln">file2</span><span class="pun">.</span><span class="pln">json
</span><span class="pun">/</span><span class="pln">tmp</span><span class="pun">/</span><span class="pln">foo</span><span class="pun">/</span><span class="pln">file3</span><span class="pun">.</span><span class="pln">json
</span><span class="pun">/</span><span class="pln">tmp</span><span class="pun">/</span><span class="pln">foo</span><span class="pun">/</span><span class="pln">bar</span><span class="pun">/</span><span class="pln">file3</span><span class="pun">.</span><span class="pln">json</span></code>

7) fuzzywuzzy

Ranking in the top 10 of simplest libraries I’ve ever used (if you have 2-3 minutes, you can read through the source), fuzzywuzzy is a fuzzy string matching library built by the fine people at SeatGeek.

fuzzywuzzy implements things like string comparison ratios, token ratios, and plenty of other matching metrics. It’s great for creating feature vectors or matching up records in different databases.

<code><span class="kwd">from</span><span class="pln"> fuzzywuzzy </span><span class="kwd">import</span><span class="pln"> fuzz
fuzz</span><span class="pun">.</span><span class="pln">ratio</span><span class="pun">(</span><span class="str">"Hit me with your best shot"</span><span class="pun">,</span> <span class="str">"Hit me with your pet shark"</span><span class="pun">)</span>
<span class="com"># 85</span></code>

8) progressbar

You know those scripts you have where you do a print "still going..." in that giant mess of a for loop you call your __main__? Yeah well instead of doing that, why don’t you step up your game and start using progressbar?

progressbar does pretty much exactly what you think it does…makes progress bars. And while this isn’t exactly a data science specific activity, it does put a nice touch on those extra long running scripts.

Alas, as another GoogleCode outcast, it’s not getting much love (the docs have 2 spaces for indents…2!!!). Do what’s right and give it a good ole pip install.

<code><span class="kwd">from</span><span class="pln"> progressbar </span><span class="kwd">import</span> <span class="typ">ProgressBar</span>
<span class="kwd">import</span><span class="pln"> time
pbar </span><span class="pun">=</span> <span class="typ">ProgressBar</span><span class="pun">(</span><span class="pln">maxval</span><span class="pun">=</span><span class="lit">10</span><span class="pun">)</span>
<span class="kwd">for</span><span class="pln"> i </span><span class="kwd">in</span><span class="pln"> range</span><span class="pun">(</span><span class="lit">1</span><span class="pun">,</span> <span class="lit">11</span><span class="pun">):</span><span class="pln">
    pbar</span><span class="pun">.</span><span class="pln">update</span><span class="pun">(</span><span class="pln">i</span><span class="pun">)</span><span class="pln">
    time</span><span class="pun">.</span><span class="pln">sleep</span><span class="pun">(</span><span class="lit">1</span><span class="pun">)</span><span class="pln">
pbar</span><span class="pun">.</span><span class="pln">finish</span><span class="pun">()</span>
<span class="com"># 60% |########################################################                                      |</span></code>

9) colorama

So while you’re making your logs have nice progress bars, why not also make them colorful! It can actually be helpful for reminding yourself when things are going horribly wrong.

colorama is super easy to use. Just pop it into your scripts and add any text you want to print to a color:

10) uuid

I’m of the mind that there are really only a few tools one needs in programming: hashing, key/value stores, and universally unique ids. uuid is the built in Python UUID library. It implements versions 1, 3, 4, and 5 of the UUID standards and is really handy for doing things like…err…ensuring uniqueness.

That might sound silly, but how many times have you had records for a marketing campaign, or an e-mail drop and you want to make sure everyone gets their own promo code or id number?

And if you’re worried about running out of ids, then fear not! The number of UUIDs you can generate is comparable to the number of atoms in the universe.

<code><span class="kwd">import</span><span class="pln"> uuid
</span><span class="kwd">print</span><span class="pln"> uuid</span><span class="pun">.</span><span class="pln">uuid4</span><span class="pun">()</span>
<span class="com"># e7bafa3d-274e-4b0a-b9cc-d898957b4b61</span></code>

Well if you were a uuid you probably would be.

11) bashplotlib

Shameless self-promotion here, bashplotlib is one of my creations. It lets you plot histograms and scatterplots using stdin. So while you might not find it replacing ggplot or matplotlib as your everyday plotting library, the novelty value is quite high. At the very least, use it as a way to spruce up your logs a bit.

<code><span class="pln">$ pip install bashplotlib
$ scatter </span><span class="pun">--</span><span class="pln">file data</span><span class="pun">/</span><span class="pln">texas</span><span class="pun">.</span><span class="pln">txt </span><span class="pun">--</span><span class="pln">pch x</span></code>