NumPy Tutorial: Learn the Basics in Minutes!

April 27, 2020

NumPy is an awesome tool that you should be familiar with if you’re a Python developer, data scientist, or just somebody interested in doing things with data.

But maybe you haven’t tried to use it or have been intimidated by its apparent learning curve and differences from standard Python data structures.

In this post, we’ll take a look at NumPy, what you can use it for, when you definitely should use it, when not to use it, and why it’s so popular.

NumPy Advantage #1: Speed

When it comes to doing heavy data processing and calculations with Python, you’ll often hear about how Python is natively pretty slow.

However, despite its relative slowness, Python has become the go-to language for data science. Applications that process millions or billions of data points use Python.

That’s odd. Why would data scientists and engineers choose a language that’s even a fraction slower? Over the course of billions of calculations, surely that adds up?

Enter Python’s C API. While Python may be one of the slower languages, the most common distribution of Python is built atop one of the fastest languages, C. When we need blazing fast calculations, Python has an interface for running C code. These C extensions make it easy to offer C-level performance inside a Python application.

NumPy is one such C extension. While CPython itself is written in C, it has to be good for general purpose use and a lot of various data types and edge cases. By contrast, NumPy implements its own custom data types in C that are optimized for fast calculations and modifications. NumPy can make calculations happen much faster, sometimes even 10x faster.

Here’s an example (thanks to tom10 for this code snippet, comments are mine):

from numpy import arange from timeit import Timer

Nelements = 10000 Ntimeits = 10000

x = arange(Nelements) # <-- A NumPy range y = range(Nelements) # <-- A standard Python range

# The numpy .sum() method is optimized for calculations t_numpy = Timer("x.sum()", "from __main__ import x")

# The CPython sum() function is more generalized t_list = Timer("sum(y)", "from __main__ import y")

print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)) print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))

On my computer, that output:

python time-sum.py numpy: 2.650e-05 list: 5.758e-04

In this case, NumPy is 20x faster at summing values in a range. That’s significant, and it starts to give you an insight into why NumPy is so valuable and widely used.

NumPy Advantage #2: Memory

NumPy is more efficient with memory and faster at reading/writing. This is because NumPy places restrictions on how you can build your arrays.

In standard Python, a list can include any values:

# This is a valid list my_list = ['a', 2, len, {'z': 1}]

In that list, we have a string, integer, function, and dictionary. But Python doesn’t actually store those 4 things sequentially in memory. Instead, a Python list is just a series of pointers to the objects themselves, wherever they exist.

So, the string 'a' lives somewhere else. The integer 2 is a singleton that Python creates at startup (as are all integers between -1 and 255). The function len is defined in the standard library. And the dictionary is actually a separate complex object, also composed of pointers to its sub-objects.

All those pointers add to the complexity and size of Python lists. Every value in a Python list is at least 20 bytes. 4 bytes per pointer, plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). [Thanks to Alex Martelli for that math.]

We can test out that list objects are passed by reference for ourselves:

>>> my_list = ['a', 2, len, {'z': 1}] >>> my_list[2](my_list) 4 >>> len(my_list) # Equivalent to the line above 4

By contrast, NumPy arrays are smart enough to figure out if you’ve passed in a list of uniform values. If you did, NumPy will optimize storage and operations on those values.

>>> a = np.array([2,3,4]) >>> a array([2, 3, 4]) >>> a.dtype dtype('int64') # Stored as a 64-bit integer, instead of Python obj

# We can still create a np array from my_list >>> b = np.array(my_list) # But everything is stored in the standard Python way as objects >>> b array(['a', 2, , {'z': 1}], dtype=object)

When you give NumPy standardized inputs, the memory optimizations can be substantial. For instance, Python would take 12GB of memory to handle a billion floats. NumPy on the other hand, could do so with about 4GB. [Hat tip again to Alex Martelli]

NumPy Advantage #3: Convenience

If you’re doing stuff with large datasets, there are some common operations you’ll need to undertake.

Vectors, matrix manipulation, multidimensional arrays, and regressions are all common.

In standard Python, these can be non-trivial to implement. In NumPy they come out-of-the-box, with optimized and tested implementations.

Moreover, you can read directly from files into a NumPy array and do other I/O tasks more efficiently for large datasets.

Additionally, most other data packages, including statistical and machine learning packages support NumPy data structures, making NumPy a good foundation for any data intensive project, even if memory and speed aren’t concerns.

Try It Out!

Hopefully, you’ve now seen how useful NumPy is and made a decision about whether it makes sense for your application.

In any case, it’s a powerful technology that’s worth learning how to use if you’re someone who comes in contact with data often. An understanding of NumPy now could pay off later when you encounter a task that could use it.

1. Installation

NumPy works best when used as part of an established scientific computing Python distribution. That way you can get the most out of it.

By far, the most popular scientific Python distribution is Anaconda. There’s also a smaller version called Miniconda that will give you access to most of what you need.

Create a new directory and install Anaconda/Miniconda in a virtual environment:

# Install the distribution pyenv install miniconda3-latest

# Create a virtual environment using that distribution pyenv virtualenv miniconda3-latest

# Set that virtual environment as the default for this directory pyenv local

Now that we have it installed, we need to activate Anaconda. Specifically, Anaconda uses its own package manager (like pip that you might be familiar with). This package manager is called conda .

You can use the conda package manager with many types of shells:

  • bash
  • fish
  • tcsh
  • xonsh
  • zsh
  • powershell

To activate it:

conda init eval $SHELL # Restart your shell so changes take effect conda activate

Now, we can install NumPy.

conda install numpy

Check it out!

$ python Python 3.8.0 (default, Nov 1 2019, 18:25:29) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.__version__ '1.11.3'

2. Try Some Tutorials

This post is already long, so I won’t make it into a beginner NumPy tutorial. Plus, there are a lot of great resources already out there. Here are my top picks, assuming you already have some Python experience:

  1. The Official Quickstart Tutorial for the basics
  2. SciPy Lecture Notes for an introduction to the scientific computing ecosystem
  3. From Python to NumPy for learning how to practically convert Python functions into much faster NumPy equivalents

3. Build something with NumPy

Test and expand your new knowledge with some exercises:

Then try some simple projects:

How & Why to Use NumPy

NumPy isn’t a perfect solution for every use case. Most notably, using NumPy can make your code more difficult to read, so be sure to comment heavily.

Nor are NumPy’s data structures a complete replacement for Python’s lists. Instead, you should treat NumPy as an additional tool that makes working with uniform datasets much easier.

As a general rule of thumb, whenever you have lots of data in the same format, you’re probably good to use NumPy, especially if there will be millions or billions of data points.

Chances are, if you have a data intensive application, there’s a way NumPy can help you optimize operations.

About Bennett

I’m a web developer building things with Python and JavaScript.

Want my best content on web development and becoming a better programmer?

I share my favorite advice with my email list — no spam, nothing sales-y, just useful content.

Join 500 other developers currently in my email series.


Profile picture

I write something new every day for 2k software developers. You should sign up for the daily email.

© 2024