Using Python Webcrawling and Numpy to Build a Bad Bracket

This is my first time doing a March Madness Bracket. I don't follow basketball - more of a Seahawks & Broncos fan. I had about an hour before the deadline to get my bracket together, and being a developer I decided to play to my strengths.

Theory:

My bracket is based on 3 axioms:

Any team is skilled enough to defeat any other. (Upsets are possible)
You want the team with the most physical players. (Basketball is a physical game)
A player's weight is mostly muscle. (I've never seen a fat basketball player)

Scoring:

Each player was given a physical score based on their weight divided by their height.

All the players scores were averaged to get a team score.

Team with the best score wins the matchup. ‘Best’ meaning the highest average weight to height ratio.

Building the Bracket:

We’re mostly a python shop so using requests and BeautifulSoup I built a simple crawler to scrape height and weight for all players and all teams in the NCAA. BeautifulSoup is *the* python tool to extract content, data, and pretty much anything on a webpage.

def table_parser(table):
    headers = [header.string for header in table.findAll('th')]
    results = []
    for row in table.findAll('tr'):
        row_dict = {headers[i]: cell.string for i, cell in enumerate(row.findAll('td'))}
        urls = row.find_all('a', limit=1)
        for url in urls:
            row_dict['url'] = url['href']
        results.append(row_dict)

    return results


def fetch_player_data(team_list):
    for team_dict in team_list:
        if 'url' in team_dict:
            page_url = DOMAIN + team_dict['url'] + 'players'
            player_page = get_url(page_url)
            soup = BeautifulSoup(player_page)
            table = soup.find('table', {'class': 'tablesaw'})
            player_data = table_parser(table)
            team_dict['players'] = player_data

    return team_list

Then I used numpy to give each player a score and average all the players to determine team scores. Numpy is a very powerful scientific computing library. It's totally overkill for this project, but why not? There’s an office trophy at stake!

Now I had a score for each team based on the full team name, but I then ran into a problem.

Data Filling:

The ESPN site has only abbreviations for teams and the team logos. I didn’t know enough to go from the abbreviations to the full team names. (Did I mention I don’t follow basketball?)

Luckily for me, ESPN put the logo next to each team. I scraped the bracket page for each logo/abbreviation combo, used google reverse image search and the ‘best guess’ to find the full team names.

This let me build the bracket based on the scores.