Using Python Webcrawling and Numpy to Build a Bad Bracket

John Rush Mar 29, 2016

This is my first time doing a March Madness Bracket. I don't follow basketball - more of a Seahawks & Broncos fan.  I had about an hour before the deadline to get my bracket together, and being a developer I decided to play to my strengths.

 

Theory:

My bracket is based on 3 axioms:

  • Any team is skilled enough to defeat any other. (Upsets are possible)
  • You want the team with the most physical players. (Basketball is a physical game)
  • A player's weight is mostly muscle. (I've never seen a fat basketball player)

 

Scoring:

Each player was given a physical score based on their weight divided by their height.

All the players scores were averaged to get a team score.

Team with the best score wins the matchup. ‘Best’ meaning the highest average weight to height ratio.

 

Building the Bracket:

We’re mostly a python shop so using requests and BeautifulSoup I built a simple crawler to scrape height and weight for all players and all teams in the NCAA. BeautifulSoup is *the* python tool to extract content, data, and pretty much anything on a webpage.

 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def table_parser(table):
    headers = [header.string for header in table.findAll('th')]
    results = []
    for row in table.findAll('tr'):
        row_dict = {headers[i]: cell.string for i, cell in enumerate(row.findAll('td'))}
        urls = row.find_all('a', limit=1)
        for url in urls:
            row_dict['url'] = url['href']
        results.append(row_dict)

    return results


def fetch_player_data(team_list):
    for team_dict in team_list:
        if 'url' in team_dict:
            page_url = DOMAIN + team_dict['url'] + 'players'
            player_page = get_url(page_url)
            soup = BeautifulSoup(player_page)
            table = soup.find('table', {'class': 'tablesaw'})
            player_data = table_parser(table)
            team_dict['players'] = player_data

    return team_list

 

Then I used numpy to give each player a score and average all the players to determine team scores. Numpy is a very powerful scientific computing library. It's totally overkill for this project, but why not? There’s an office trophy at stake!

Now I had a score for each team based on the full team name, but I then ran into a problem.

 

Data Filling:

The ESPN site has only abbreviations for teams and the team logos. I didn’t know enough to go from the abbreviations to the full team names. (Did I mention I don’t follow basketball?)

Luckily for me, ESPN put the logo next to each team. I scraped the bracket page for each logo/abbreviation combo, used google reverse image search and the ‘best guess’ to find the full team names.

This let me build the bracket based on the scores.

Team_Logo.pngbest_guess.png

 

Results:

Bottom of the BrandVerity Group - and in ESPN’s 6.6 percentile. Very Bad.

bracket_standings.png

Analysis:

The scoring method I used has a side effect - players who are shorter and weigh more score higher than taller, skinnier players.

Turns out shorter, thicker teams do poorly in basketball. Who knew?

 

We’re hiring!

If webcrawling, big data and machine learning sounds appealing, get in touch! We are looking for passionate people with diverse backgrounds.

 

Topics: Culture

Don't Miss Out

Get the latest insights on brand protection, compliance, and paid search delivered right to your inbox.

What you don't know will hurt you. Start monitoring and protecting your brand.