Posts about intro

Inverted Index Project

2016-11-26T16:52:56-05:00

I haven't spoken much about the class I've been teaching this semester. It's an intro CS course - a programming heavy intro. I decided to use Python with a transition at the end to C++. The transition is to mirror Hunter's normal first CS course that ends with a C++ intro to prepare the students for next semester's CS course which is a more intense OOP class using C++ - the language we use in our core courses.

Throughout the semester I've tried to use a variety of interesting application areas so as to try to give the students some idea of the possibilities that studying CS will open up for them.

After covering Python dictionaries and lists I thought we'd play by building an inverted Index.

The basic idea is to map a set of words back to source files. For example, given the following four one line files:

files			contents
file.01			if you prick us do we not bleed
file.02			if you tickle us do we not laugh
file.03			if you poison us do we not die and
file.04			if you wrong us shall we not revenge

You could build a data structure mapping each word back to the file(s) that contain it (partially shown here),

Word		Files containing It's
if		file.01 file.02 file.03 file.04
you		file.01 file.02 file.03 file.04
prick		file.01
us		file.01 file.02 file.03 file.04
do		file.01 file.02 file.03

You can, of course, store more information - how many times a word appears in a file, where it appears, etc.

This is a fairly easy structure to build. A dictionary where the keys are the words in the file and the values are lists of the documents containing the words.

  inverted_index = {
      'if' : ['file.01','file.02','file.03','file.04'],
      'you' : ['file.01','file.02','file.03','file.04'],
      'prick' : ['file.01'],
      'us' : ['file.01','file.02','file.03','file.04'],
      'do' : ['file.01','file.02','file.03'],
      ...
}

In addition to letting us work with dictionaries and lists, we can also review file access and even the python CSV module if we want.

We can immediately write simple queries – "what document(s) contain the word 'prick,' but things get more interesting if you write functions to perform and and or queries - "what document(s) contain the words 'prick' or 'do'" for instance.

Why are we building this (besides as a data structure and programming exercise)? I've seen a number of references to using an inverted index when building a web search engine. In fact, I think that's something you do early on in the Udacity Mooc. I just wanted to play with information retrieval.

I remembered that there was a collection of information, including last statements from executed offenders in Texas. Someone conveniently converted it into a Google Spreadsheet. The format's a little different from our simple four file example but then there's more data. It's straightforward enough to download the spreadsheet as a CSV file and then read it with a Python program that builds it into an inverted index.

Now we have some interesting data to play with.

How many offenders used words like "sorry" or "apologize?" How about references to religion? We can do all sorts of and and or queries.

We just played with this a bit but I could see all sorts of explorations. What about taking some great work of literature and turning it into an inverted index by chapter. You could query characters or certain words and see where and when they appear in the book. A new and different way of exploring literature.

So, there you have it - an interesting little project we played with this past semester. We did it in an intro Python course but I could see it as an interesting project in AP CS A using hashmaps and lists.

Madlib Madness

2013-04-30T00:00:00-04:00

Earlier in the term, our intro classes spent a little time learning some basic HTML. We don't spend a lot of time on it, just enough so that the students can present their work in a static web site. The end goal, though, was to programatically generate the web sites - there's nothing quite as empowering to a student as when they can present their work to the world.

Finally, it's all coming together.

Now that the classes are comfortable with Python, we can have some fun. We all remember Mad Libs - that wacky word game where you select unknowingly select words to substitute into a basic story and hilarity ensues.

We did our own versions using Python files, lists and dictionaries.

Here are some of the results: 1. http://homer.stuy.edu/~richard.zhan/19-Madlibs.py 2. http://homer.stuy.edu/~veronika.azzara/madlibifystory.py 3. http://homer.stuy.edu/~belinda.liang/18-MadLibsMiniProject.py 4. http://homer.stuy.edu/~kyle.oleksiuk/MadlibifyProject5.py 5. http://homer.stuy.edu/~phillip.huynh/story.py

The students wrote a basic story with substitution points. Their programs then randomly replaced these points with words from an assortment of categorized lists.

Enjoy!!!!!

Who won the election -- Quadratic to Linear Time!!!!!

2013-03-23T00:00:00-04:00

Last week was crazy. Busy, stressful, late night after late night. It ended, though, on a great note.

A young lady in my intro class found me in my office near the end of the day:

Student: Mr. Z, I wanted to make sure to catch you before vacation!

Me: What's up?

Student: I wanted to tell you that today's lesson was AWESOME!!!!!!

Wow. I've been teaching 23 years and that's never happened before!!!!

So, what was the hubbub about?

We've been doing list processing in Python over the past few days. We already did the basics, such as finding the largest element in a list:

{% highlight python linenos %} def find_max(L): maxval = L[0] i=0 while imaxval: maxval=L[i] i += 1 return maxVal

We've also done basic searching, counting elements, removing elements, etc.

Today we started with finding the mode of a list of grades.

Most students approached the problem as a maximum problem. Assume the first item is the mode and find it's frequency, then proceed through the list each time seeing if the current node occurs more fequently than the "mode so far." Pretty much the same idea as find_max (but in this case, returning a list of all the modes).

{% highlight python linenos %} def mode(L): modecount = L.count( L[0] ) modes = [ L[0] ] i = 1 while i < len(L): c = L.count(L[i]) if c > modecount: modecount = c modes = [ L[i] ] elif c==modecount and L[i] not in modes: modes.append( L[i] ) i += 1 return modes

Pretty cool. The kids are doing something pretty sophisticated here.

Time to look deeper. We started running this on larger and larger data sets. Things started really slowing down at about 20K. We then timed things to get some numbers (thanks StackOverflow).

What was going on. The students pretty quickly honed in on the line that called L.count(L[i]) -- Hidden Complexity.

We haven't done big-O notation but the class easily saw that count had to go through the entire data set and we ended up with an N^2 algorithm. For example, if we have 10 items, the main loop executes 10 times and each time, count goes through the entire list (10 items) as well. If we go to 100 items, it becomes 100x100.

What to do????

Time to talk about what's probably the most discussed instance of mode finding - elections. The winner is "the mode of the ballots."

Of course we don't use the above algorithm. We usually tally or count the ballots. We go through the ballots once, each time adding one to the appropriate candidates "bucket."

From here, it's a short step to see that we can use a list. It's indices represent the grade values and the data in the list the counts or tallies:

{% highlight python linenos %} def fastmode(L): i=0 counts = [] while i<max(L)+1: counts.append(0) i+=1 i=0 while i < len(L): counts[ L[i] ] += 1 i += 1 modecount = max(counts) modes = [] i=0 while i < len(counts): if counts[i]==modecount: modes.append(i) i=i+1 return modes

We go through the list once to build the tallies and then the "tally" list once to get the modes. Simple, straightforward, and linear time!!!!!!!!!

The original routine started to hit a roadblock at about 20K items, here we got to one million without breaking a sweat.

The take away:

Get it working first.
Then profile to find your bottleneck
Look at the problem in a different way
Using data structures in a clever way can really improve performance.

Layers of a lesson

2012-12-17T00:00:00-05:00

My last post I was talking about the fact that as teachers, our knowledge and experience is frequently trivialized. The tenor of the times is that anyone can design a course, anyone can teach, and in fact, we don't even need teachers, just videos or computer based systems. If you've ever tutored a friend, you're more than qualified.

That might be a strong statement but everywhere you look you see "education" programs designed and implemented by non teachers. It seems that it's believed that teaching only involves the most superficial of transfers of information.

Today, I thought I'd look at a lesson I taught the other week. How I've seen similar material presented and how my colleagues and I might treat the subject.

We use NetLogo in our Sophomore level intro course. It's a highly parallel version of logo. It's very visual, it's great for modeling and you can introduce deep, meaningful concepts such as parallel processing in a gentle manner.

Early on the kids have to learn how to manipulate the turtles. In NetLogo you write a single program and it's run by all the turtles "at once." The image above is one of their early "experiments." Have the turtles wiggle out of the center, but when they get to an invisible border, start spinning. They do a number of variations on this theme.

A solution might look like this:

; asked in a turtle context
to gospin
  ifelse abs xcor < 8 or abs ycor < 8 
    [ wiggle ] ; wiggle implementation not shown
    [ left 5 ]
end

Let's call level one just talking about the solution by looking at the program as a sequence of instructions. Specifically relating the instructions to the problem, showing how it solves it, and that's it.

This is the simplest level. A book, video, or online courseware can approach teaching at this level. A non computer scientist teacher or a non teacher computer scientist could do so as well. Students might learn a bit but I wouldn't hope for much inspiration or creativity to come out of it.

Let's move to level two.

Here we might talk about "what the turtles are doing." They're always doing something, either wiggling or spinning. This is a step in the right direction. When done right, the students start thinking about the problem in a more general sense but they're still looking at the problem as something that exists only in the world of NetLogo. They are more likely develop patterns than in level one, but it's still limited.

Level three is where things get interesting. On the surface, the problem is just a nice introduction to programming turtles in NetLogo. At a deeper level, it's an opportunity to introduce the kids to State Machines. A new way of thinking about problems and problem solving.

Students understand the idea of a "state." For example, in class, they're in a "seated state," maybe in a "note taking state," etc. It's easy to see that they don't know what their day will bring but they constantly make decisions based on their "state." Likewise, they can think about the turtle as in a state. It's either in a wiggling state or a spinning state and based on their situation they can either continue in their current state or transition to the other one:

This opens up a new way of thinking and it's easy to see how this extends to other problems, for example, a ghost from pacman:

A good teacher thinks about working across these levels. He adjusts based on the class and looks for opportunities to develop these deeper concepts.

Pair Programming Tag Team Shootout

2012-03-01T00:00:00-05:00

So today we changed things up a bit.

Instead of having a typical lab type periods, we tried the Pair Programming Tag Team Shootout.

We aren't annualized so while the kids that have been with me since September have been working in pairs for a while, the other half of the class is just getting used to how we do it. I also wanted to get the kids to mix a little more.

Hence the shootout.

Everyone got a sheet with a bunch of problems on it:

Shootout

I then paired them off randomly.

The idea is complete the first problem, find a new partner, repeat.

By the end of the period each student worked with between five and seven partners.

I'm having them send me their solutions and partners tonight.

The early response was good -- it's speeding up them getting to know each other and it was a nice change of pace. We had some problems coordinating switching problems, but we'll do better next time.

All in all a good day.

Let me Google that for you

2012-02-08T00:00:00-05:00

Piloting a new course this semester - Intro to Computer Science part 2. Between the existing Intro part 1 and this, we should be able to do a pretty thorough job in preparing our kids for the future.

We decided that we wanted the kids to make deliverables in the form of web pages - plain old html written by hand. Part of the idea was to demystify things, part was to let the kids show off their work, part was to have something that they can generate programatically as the course progressed, and part was to give them a tool they might find valuable beyond their computer science classes.

We also wanted to help teach the kids how to find information and how to learn things on their own. Despite the fact that our students use computers all the time, they possess a widely varying skill set. With that in mind, here's what we tried to do:

After a brief introduction to what a web page is (just a text file with markup) and showing them the bare
minimum of markup:

I recommended a simple editor - gedit - while resisting all my inner urges for all things emacs, and then showed them an image of a web page:

The end goal was to make a page that had all of the elements in the above image but I also asked:

How did they go about finding out how to make the page?
Where did they search?
what turned up bad results (and what were they)?
what turned up good results (and what were they)?

I was very pleased with the results. Just about all the kids are now able to make a web page with the components in the image above. More importantly, this is what came out of our discussion:

Everyone used Google exclusively as a search engine.
The range of queries ranged from things like "html tutorial," "making a web page," and just plain "html" to maybe not so good things like "gedit web page."
No one used social search or used facebook.
They mostly all found sites such as w3schools.

I'm hoping this is a good first step in having the students find things on their own and not be afraid to try things. I think it's an encouraging start.