Course Hero Logo

Classification CoverType.py - # coding: utf-8 # # Problem...

Course Hero uses AI to attempt to automatically extract content from documents to surface to you and others so you can study better, e.g., in search results, to enrich docs, and more. This preview shows page 1 - 4 out of 13 pages.

# coding: utf-8# ##Problem Statement## In this programming assignment, your task is to classifygeographical locations according to their predicted tree coverusing Gradient Boosting and Random Forest classifiers. You areexpected to fill in functions that would complete this task. Allof the necessary helper code is included in this notebook.However, we advise you to go over the slides, lecture material,the EdX videos and the corresponding notebooks before you attemptthis Programming Assignment. You can find information about thedataset to be used in the following links:## * **Dataset:** ## * **Dataset description:**-databases/covtype/covtype.info# ##Notebook Setup# In[1]:# To time the entire solutionimport timestart_nb = time.time()# In[2]:import osos.environ["PYSPARK_PYTHON"]="python3"os.environ["PYSPARK_DRIVER_PYTHON"] = "python3"from pyspark import SparkContextsc=SparkContext()# In[3]:
from pyspark.mllib.linalg import Vectorsfrom pyspark.mllib.regression import LabeledPointfrom pyspark.mllib.tree import GradientBoostedTrees,GradientBoostedTreesModelfrom pyspark.mllib.tree import RandomForest, RandomForestModelfrom pyspark.mllib.util import MLUtils# import osimport picklefrom os.path import existsget_ipython().magic('config IPCompleter.greedy=True')# In[4]:#define a dictionary of cover typesCoverTypes={1.0: 'Spruce/Fir',2.0: 'Lodgepole Pine',3.0: 'Ponderosa Pine',4.0: 'Cottonwood/Willow',5.0: 'Aspen',6.0: 'Douglas-fir',7.0: 'Krummholz' }print('Tree Cover Types:', CoverTypes)# ## Collecting Data# In[5]:# Break up features that are made out of several binary features.def get_columns(cols_txt):cols=[a.strip() for a in cols_txt.split(',')]colDict={a:[a] for a in cols}colDict['Soil_Type (40 binary columns)'] = ['ST_'+str(i) for iin range(40)]colDict['Wilderness_Area (4 binarycolumns)'] = ['WA_'+str(i)for i in range(4)]columns=[]for item in cols:
columns = columns + colDict[item]return columns#print(columns)# In[6]:# Define the feature namescols_txt="""Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways,Hillshade_9am, Hillshade_Noon, Hillshade_3pm,Horizontal_Distance_To_Fire_Points, Wilderness_Area (4binarycolumns),Soil_Type (40 binary columns), Cover_Type"""columns = get_columns(cols_txt)# In[7]:# Read the file into an RDD# When using sc.textRead you need to use an absolute path.

Upload your study docs or become a

Course Hero member to access this document

Upload your study docs or become a

Course Hero member to access this document

End of preview. Want to read all 13 pages?

Upload your study docs or become a

Course Hero member to access this document

Term
Summer
Professor
Yoav Freund
Tags
RDD

Newly uploaded documents

Show More

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture