{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "- Title: Files and How Computers Represent Data \n", "- Date: 2018-11-30 \n", "- Tags: python, programming, week2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this lesson, we're going to learn how to open files and work with data from the disk. We'll start with the mechanical process of opening text files, and then move on to learn a little bit more about different kinds of data you'll see.\n", "\n", "Here's the basic method of opening and reading text files. Suppose I have a file called hello.txt in my working directory. (Your working directory is the directory you run Python from on your hard drive. For those of you using Azure Notebooks, this should be your library, but talk to me if you see a file there and can't read it from Python.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello, I am a text file\n" ] } ], "source": [ "with open(\"hello.txt\", 'r') as my_awesome_file:\n", " hello = my_awesome_file.read()\n", "print(hello)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's break down that code. The first line starting with `with` is known as a *context manager*. Any time you see `with` in Python, what you should think is \"this is going to change what's going on in my computer for the duration of this block.\" \n", "\n", "In this case, you can pretty much read exactly what's going to happen like it's English rather than Python. For the duration of the indented block below that first line, all the code in there is going to be executed `with` the file `hello.text` being `open` and hence available for reading. Inside that block, the name `my_awesome_file` is going to be assigned to the open file (that's what the `as` statement does). The second parameter to the `open` function, the `'r'`, just indicates that you're going to open it for reading---instead of, for example, opening it for writing, in which case the `r` would change to a `w`. You can also open a file to append with `a`---to add more data to the bottom of the file.\n", "\n", "The thing that's assigned to `my_open_file` is called a *file handle*, it's just a normal Python object that gives you access to the data inside. In this case, that object has a method, `read()`, that gives you the contents of the file as a string. Which we then printed.\n", "\n", "Let's look at writing and appending." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I contain something different now!!\n" ] } ], "source": [ "with open(\"hello.txt\", \"w\") as still_my_file:\n", " still_my_file.write(\"I contain something different now!!\")\n", "\n", "with open(\"hello.txt\", 'r') as my_awesome_file:\n", " hello = my_awesome_file.read()\n", "print(hello)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You see that when we open a file for writing, we overwrite what was already there. What if you don't want to do that? " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I contain something different now!!Hi again!Hi human user!\n" ] } ], "source": [ "with open(\"hello.txt\", \"a\") as my_file:\n", " my_file.write(\"Hi again!\")\n", " my_file.write(\"Hi human user!\")\n", "\n", "with open(\"hello.txt\", 'r') as my_awesome_file:\n", " hello = my_awesome_file.read()\n", "print(hello)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we want to put more lines in the file, and then read the file line by line into a list. We could do that, in the first case, by adding the special character `'\\n'`, and in the second case, by using the `readlines()` method of the file object." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I contain something different now!!Hi again!Hi human user!\n", "Here's a newline. But:\n", "Newlines don't have to be at the start of a string.\n" ] } ], "source": [ "with open(\"hello.txt\", \"a\") as my_file:\n", " my_file.write(\"\\nHere's a newline. \")\n", " my_file.write(\"But:\\nNewlines don't have to be at the start of a string.\")\n", "\n", "with open(\"hello.txt\", 'r') as my_awesome_file:\n", " hello = my_awesome_file.read()\n", "print(hello)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I contain something different now!!Hi again!Hi human user!\\n', \"Here's a newline. But:\\n\", \"Newlines don't have to be at the start of a string.\"]\n" ] } ], "source": [ "with open(\"hello.txt\", 'r') as my_awesome_file:\n", " hellolist = my_awesome_file.readlines()\n", "print(hellolist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As far as Python is concerned, there are two kinds of files that you might want to open: text files and binary files. If you want to open a binary file, you pass `'rb'` or `'wb'` to `open()` as the second parameter, depending on whether you want to read or write the binary file. \n", "\n", "Pretty much all the files you'll be working with in this course will be text files: they'll be txt files, csv files (spreadsheet-like data stored in a plain text format), json files (key-value and list-like data stored in a plain text format), or the like. Here are some common binary file formats: \n", "\n", "- zip files (compressed file archives) \n", "- Microsoft Word docx files \n", "- PDF files \n", "- images of all kinds (with the exception of SVG files, which are fancy vector images that are stored as text\n", "\n", "In almost every case, it'll make more sense to use a function from a library to open a binary file, rather than manipulate it directly. There are libraries to handle PDFs, Word files, and the like. So I won't belabor it here, but the difference between reading as text and reading as binary is one you need in your head for a moment (I'm about to break it)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reading CSV files, the best choice is to use the *Pandas* library, which should be installed for you already, rather than the built-in Python CSV library. The latter is a bit obscurely organized. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "mydata = pd.read_csv(\"rol-scores.csv\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | State | \n", "Pop. In Millions for 2012 | \n", "RoLScore | \n", "elec_pros | \n", "pol_plur | \n", "free_expr | \n", "assoc_org | \n", "per_auto | \n", "2012GDP | \n", "hprop | \n", "hfisc | \n", "hbiz | \n", "hlab | \n", "htra | \n", "hinv | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Albania | \n", "3.2 | \n", "42.60 | \n", "8 | \n", "10 | \n", "13 | \n", "8 | \n", "9 | \n", "1.264810e+10 | \n", "30 | \n", "92.6 | \n", "81.0 | \n", "49.0 | \n", "79.8 | \n", "65 | \n", "
1 | \n", "Argentina | \n", "41.1 | \n", "51.94 | \n", "11 | \n", "15 | \n", "14 | \n", "11 | \n", "13 | \n", "4.755020e+11 | \n", "15 | \n", "64.3 | \n", "60.1 | \n", "47.4 | \n", "67.6 | \n", "40 | \n", "
2 | \n", "Australia | \n", "22.7 | \n", "73.28 | \n", "12 | \n", "15 | \n", "16 | \n", "12 | \n", "15 | \n", "1.532410e+12 | \n", "90 | \n", "66.4 | \n", "95.5 | \n", "83.5 | \n", "86.2 | \n", "80 | \n", "
3 | \n", "Austria | \n", "8.4 | \n", "73.15 | \n", "12 | \n", "15 | \n", "16 | \n", "12 | \n", "15 | \n", "3.947080e+11 | \n", "90 | \n", "51.1 | \n", "73.6 | \n", "80.4 | \n", "86.8 | \n", "85 | \n", "
4 | \n", "Bangladesh | \n", "154.7 | \n", "31.57 | \n", "9 | \n", "11 | \n", "9 | \n", "8 | \n", "9 | \n", "1.163550e+11 | \n", "20 | \n", "72.7 | \n", "68.0 | \n", "51.9 | \n", "54.0 | \n", "55 | \n", "