日本欧洲视频一区_国模极品一区二区三区_国产熟女一区二区三区五月婷_亚洲AV成人精品日韩一区18p

COMP 330代做、Python設(shè)計(jì)程序代寫

時(shí)間:2024-04-02  來源:  作者: 我要糾錯(cuò)



COMP 330 Assignment #5
1 Description
In this assignment, you will be implementing a regularized, logistic regression to classify text documents. The implementation will be in Python, on top of Spark. To handle the large data set that we will be
giving you, it is necessary to use Amazon AWS.
You will be asked to perform three subtasks: (1) data preparation, (2) learning (which will be done via
gradient descent) and (3) evaluation of the learned model.
Note: It is important to complete HW 5 and Lab 5 before you really get going on this assignment. HW
5 will give you an opportunity to try out gradient descent for learning a model, and Lab 5 will give you
some experience with writing efficient NumPy code, both of which will be important for making your A5
experience less challenging!
2 Data
You will be dealing with a data set that consists of around 170,000 text documents and a test/evaluation
data set that consists of 18,700 text documents. All but around 6,000 of these text documents are Wikipedia
pages; the remaining documents are descriptions of Australian court cases and rulings. At the highest level,
your task is to build a classifier that can automatically figure out whether a text document is an Australian
court case.
We have prepared three data sets for your use.
1. The Training Data Set (1.9 GB of text). This is the set you will use to train your logistic regression
model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TrainingDataOneLinePerDoc.txt
2. The Testing Data Set (200 MB of text). This is the set you will use to evaluate your model:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
or as direct S3 address, so you can use it in a Spark job:
s3://chrisjermainebucket/comp330 A5/TestingDataOneLinePerDoc.txt
3. The Small Data Set (37.5 MB of text). This is for you to use for training and testing of your model on
a smaller data set:
https://s3.amazonaws.com/chrisjermainebucket/comp330 A5/SmallTrainingDataOneLinePerDoc.txt
Some Data Details to Be Aware Of. You should download and look at the SmallTrainingData.txt
file before you begin. You’ll see that the contents are sort of a pseudo-XML, where each text document
begins with a <doc id = ... > tag, and ends with </doc>. All documents are contained on a single
line of text.
Note that all of the Australia legal cases begin with something like <doc id = ‘‘AU1222’’ ...>;
that is, the doc id for an Australian legal case always starts with AU. You will be trying to figure out if the
document is an Australian legal case by looking only at the contents of the document.
1
3 The Tasks
There are three separate tasks that you need to complete to finish the assignment. As usual, it makes
sense to implement these and run them on the small data set before moving to the larger one.
3.1 Task 1
First, you need to write Spark code that builds a dictionary that includes the 20,000 most frequent words
in the training corpus. This dictionary is essentially an RDD that has the word as the key, and the relative
frequency position of the word as the value. For example, the value is zero for the most frequent word, and
19,999 for the least frequent word in the dictionary.
To get credit for this task, give us the frequency position of the words “applicant”, “and”, “attack”,
“protein”, and “car”. These should be values from 0 to 19,999, or -1 if the word is not in the dictionary,
because it is not in the to 20,000.
Note that accomplishing this will require you to use a variant of your A4 solution. If you do not trust
your A4 solution and would like mine, you can post a private request on Piazza.
3.2 Task 2
Next, you will convert each of the documents in the training set to a TF-IDF vector. You will then use
a gradient descent algorithm to learn a logistic regression model that can decide whether a document is
describing an Australian court case or not. Your model should use l2 regularization; you can play with in
things a bit to determine the parameter controlling the extent of the regularization. We will have enough
data that you might find that the regularization may not be too important (that is, it may be that you get good
results with a very small weight given to the regularization constant).
I am going to ask that you not just look up the gradient descent algorithm on the Internet and implement
it. Start with the LLH function from class, and then derive your own gradient descent algorithm. We can
help with this if you get stuck.
At the end of each iteration, compute the LLH of your model. You should run your gradient descent
until the change in LLH across iterations is very small.
Once you have completed this task, you will get credit by (a) writing up your gradient update formula,
and (b) giving us the fifty words with the largest regression coefficients. That is, those fifty words that are
most strongly related with an Australian court case.
3.3 Task 3
Now that you have trained your model, it is time to evaluate it. Here, you will use your model to predict
whether or not each of the testing points correspond to Australian court cases. To get credit for this task,
you need to compute for us the F1 score obtained by your classifier—we will use the F1 score obtained as
one of the ways in which we grade your Task 3 submission.
Also, I am going to ask you to actually look at the text for three of the false positives that your model
produced (that is, Wikipedia articles that your model thought were Australian court cases). Write paragraph
describing why you think it is that your model was fooled. Were the bad documents about Australia? The
legal system?
If you don’t have three false positives, just use the ones that you had (if any).
4 Important Considerations
Some notes regarding training and implementation. As you implement and evaluate your gradient descent algorithm, here are a few things to keep in mind.
2
1. To get good accuracy, you will need to center and normalize your data. That is, transform your data so
that the mean of each dimension is zero, and the standard deviation is one. That is, subtract the mean
vector from each data point, and then divide the result by the vector of standard deviations computed
over the data set.
2. When classifying new data, a data point whose dot product with the set of regression coefs is positive
is a “yes”, a negative is a “no” (see slide 15 in the GLM lecture). You will be trying to maximize the
F1 of your classifier and you can often increase the F1 by choosing a different cutoff between “yes”
and “no” other than zero. Another thing that you can do is to add another dimension whose value is
one in each data point (we discussed this in class). The learning process will then choose a regression
coef for this special dimension that tends to balance the “yes” and “no” nicely at a cutoff of zero.
However, some students in the past have reported that this can increase the training time.
3. Students sometimes face overflow problems, both when computing the LLH and when computing the
gradient update. Some things that you can do to avoid this are, (1) use np.exp() which seems to
be quite robust, and (2) transform your data so that the standard deviation is smaller than one—if you
have problems with a standard deviation of one, you might try 10−2 or even 10−5
. You may need to
experiment a bit. Such are the wonderful aspects of implementing data science algorithms in the real
world!
4. If you find that your training takes more than a few hours to run to convergence on the largest data set,
it likely means that you are doing something that is inherently slow that you can speed up by looking
at your code carefully. One thing: there is no problem with first training your model on a small sample
of the large data set (say, 10% of the documents) then using the result as an initialization, and continue
training on the full data set. This can speed up the process of reaching convergence.
Big data, small data, and grading. The first two tasks are worth three points, the last four points. Since it
can be challenging to run everything on a large data set, we’ll offer you a small data option. If you train your
data on TestingDataOneLinePerDoc.txt, and then test your data on SmallTrainingDataOneLinePerDoc.twe’ll take off 0.5 points on Task 2 and 0.5 points on Task 3. This means you can still get an A, and
you don’t have to deal with the big data set. For the possibility of getting full credit, you can train
your data on the quite large TrainingDataOneLinePerDoc.txt data set, and then test your data
on TestingDataOneLinePerDoc.txt.
4.1 Machines to Use
If you decide to try for full credit on the big data set you will need to run your Spark jobs three to five
machines as workers, each having around 8 cores. If you are not trying for the full credit, you can likely
get away with running on a smaller cluster. Remember, the costs WILL ADD UP QUICKLY IF YOU
FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and shut down your cluster as soon as
you are done working. You can always create a new one easily when you begin your work again.
4.2 Turnin
Create a single document that has results for all three tasks. Make sure to be very clear whether you
tried the big data or small data option. Turn in this document as well as all of your code. Please zip up all
of your code and your document (use .gz or .zip only, please!), or else attach each piece of code as well as
your document to your submission individually. Do NOT turn in anything other than your Python code and
請加QQ:99515681  郵箱:99515681@qq.com   WX:codinghelp













 

標(biāo)簽:

掃一掃在手機(jī)打開當(dāng)前頁
  • 上一篇:AIC2100代寫、Python設(shè)計(jì)程序代做
  • 下一篇:COMP3334代做、代寫Python程序語言
  • 無相關(guān)信息
    昆明生活資訊

    昆明圖文信息
    蝴蝶泉(4A)-大理旅游
    蝴蝶泉(4A)-大理旅游
    油炸竹蟲
    油炸竹蟲
    酸筍煮魚(雞)
    酸筍煮魚(雞)
    竹筒飯
    竹筒飯
    香茅草烤魚
    香茅草烤魚
    檸檬烤魚
    檸檬烤魚
    昆明西山國家級(jí)風(fēng)景名勝區(qū)
    昆明西山國家級(jí)風(fēng)景名勝區(qū)
    昆明旅游索道攻略
    昆明旅游索道攻略
  • NBA直播 短信驗(yàn)證碼平臺(tái) 幣安官網(wǎng)下載 歐冠直播 WPS下載

    關(guān)于我們 | 打賞支持 | 廣告服務(wù) | 聯(lián)系我們 | 網(wǎng)站地圖 | 免責(zé)聲明 | 幫助中心 | 友情鏈接 |

    Copyright © 2025 kmw.cc Inc. All Rights Reserved. 昆明網(wǎng) 版權(quán)所有
    ICP備06013414號(hào)-3 公安備 42010502001045

    日本欧洲视频一区_国模极品一区二区三区_国产熟女一区二区三区五月婷_亚洲AV成人精品日韩一区18p

              9000px;">

                        在线看一区二区| 国产成人亚洲精品青草天美| 日韩一区二区精品| 欧美a级理论片| 亚洲精品一区二区三区99| 国产91精品一区二区麻豆亚洲| 久久久综合视频| 在线成人免费观看| 欧美视频在线不卡| av成人免费在线| 国产不卡视频一区| 国产成人精品免费在线| 亚洲综合一区在线| 欧美人体做爰大胆视频| 色综合一个色综合| 色婷婷精品久久二区二区蜜臀av| 成人精品视频一区二区三区尤物| 成人少妇影院yyyy| 91丝袜国产在线播放| 色婷婷综合久久久久中文一区二区| 不卡在线观看av| 91女厕偷拍女厕偷拍高清| av在线一区二区三区| 99久久精品免费精品国产| 色婷婷国产精品久久包臀 | 欧美视频中文字幕| 欧美精品色综合| 成人高清视频在线| 在线亚洲一区二区| 欧美三级三级三级| 欧美zozozo| 国产精品色婷婷久久58| 亚洲男人天堂av网| 青青草视频一区| 国产精品一区二区三区四区 | 成人中文字幕在线| 在线影视一区二区三区| 欧美高清视频不卡网| 久久在线免费观看| 亚洲男人的天堂网| 麻豆精品在线观看| 99免费精品在线观看| 欧美伊人久久久久久久久影院| 欧美日韩高清不卡| 国产高清视频一区| 日韩三级电影网址| 亚洲免费观看高清完整版在线 | 蜜臀av一级做a爰片久久| 久久av资源网| 精品嫩草影院久久| 亚洲欧美一区二区三区孕妇| 午夜久久久久久久久久一区二区| 久久精品999| 欧美在线短视频| 国产午夜精品一区二区三区视频| 亚洲一区二区精品视频| 国产成人在线看| 精品入口麻豆88视频| 不卡高清视频专区| 亚洲欧美在线视频观看| 精品一区二区三区在线播放| 国产福利一区二区三区在线视频| 国产一区二区精品久久| 欧美亚洲高清一区| 亚洲欧美日韩中文字幕一区二区三区| 美女网站一区二区| 精品国产乱码久久| 国产麻豆午夜三级精品| 一区二区高清视频在线观看| 亚洲一二三四久久| 麻豆精品国产91久久久久久| 粉嫩av亚洲一区二区图片| 欧美精品一区二区三区久久久| 一区二区三区91| 在线视频亚洲一区| 日韩三级视频在线看| 在线成人高清不卡| 欧美精品在线视频| 日本韩国一区二区| 欧美视频第二页| 国产精品久久久久久久久快鸭| av高清久久久| 麻豆成人久久精品二区三区小说| 欧美va亚洲va在线观看蝴蝶网| 国产尤物一区二区在线| 一区二区三区欧美日| 91美女福利视频| 麻豆91在线看| 日本电影欧美片| 91欧美激情一区二区三区成人| 久久久久九九视频| 国产不卡在线视频| 国产农村妇女毛片精品久久麻豆 | 欧美日韩国产片| 亚洲一区二区3| 欧美日本在线视频| 日韩高清在线观看| 久久久久久一级片| 99re热视频精品| 艳妇臀荡乳欲伦亚洲一区| 欧美无人高清视频在线观看| 亚洲愉拍自拍另类高清精品| 欧美一区二区三区婷婷月色| 狠狠v欧美v日韩v亚洲ⅴ| 欧美成人性福生活免费看| 国产麻豆精品视频| 亚洲欧美在线另类| 欧美日韩国产综合草草| 久久99精品网久久| 国产精品激情偷乱一区二区∴| 99视频有精品| 亚洲国产精品影院| 久久久久久久综合狠狠综合| 一本大道久久a久久综合| 日韩和欧美一区二区| 久久久久久99久久久精品网站| caoporn国产一区二区| 亚洲综合激情另类小说区| 日韩欧美国产一区二区三区| 99久久婷婷国产| 午夜精品视频在线观看| 国产精品视频看| 777奇米成人网| 成人高清视频免费观看| 日韩激情av在线| 亚洲免费av高清| 久久久美女毛片| 欧美日韩二区三区| 成人精品在线视频观看| 日韩高清中文字幕一区| 综合av第一页| 欧美草草影院在线视频| 日本精品一区二区三区高清| 国内久久婷婷综合| 亚洲高清不卡在线| 中文字幕在线不卡| 精品va天堂亚洲国产| 欧美绝品在线观看成人午夜影视| 99re这里都是精品| 国产激情一区二区三区| 免费的成人av| 亚洲高清中文字幕| 亚洲免费在线播放| 国产精品免费看片| 久久精品亚洲麻豆av一区二区 | 麻豆精品国产91久久久久久| 中文字幕一区二区三区不卡| 精品国产区一区| 日韩一级完整毛片| 欧美日韩国产高清一区二区三区| 91小视频免费观看| 成人99免费视频| 国产成人99久久亚洲综合精品| 久久综合综合久久综合| 日本三级韩国三级欧美三级| 亚洲一级在线观看| 一区二区三区精品在线观看| 国产精品国产三级国产专播品爱网| 精品不卡在线视频| 久久中文娱乐网| 久久青草国产手机看片福利盒子| 欧美一区二区播放| 91麻豆精品国产自产在线| 欧美日韩高清一区| 欧美一区日本一区韩国一区| 欧美日韩精品一区二区| 欧美日韩一区三区| 欧美男男青年gay1069videost| 欧美日韩一卡二卡三卡| 欧美日韩国产高清一区二区三区 | 亚洲午夜久久久久久久久久久| 18成人在线视频| 亚洲人成7777| 无码av免费一区二区三区试看| 视频一区在线播放| 黄色小说综合网站| 国产v综合v亚洲欧| 99国产精品久久久久久久久久| 色综合欧美在线| 欧美日韩mp4| 精品美女被调教视频大全网站| 国产亚洲欧洲997久久综合 | 日本电影欧美片| 欧美久久久一区| 日韩免费成人网| 欧美激情资源网| 亚洲美女淫视频| 日本在线不卡视频| 国产精品亚洲成人| 欧美在线视频全部完| 欧美一区二区在线观看| 国产三级精品视频| 亚洲一线二线三线视频| 久久aⅴ国产欧美74aaa| 成人国产视频在线观看| 欧美日韩国产小视频| 久久久久久亚洲综合影院红桃| 亚洲欧洲综合另类在线| 久久精品噜噜噜成人av农村| 成人ar影院免费观看视频|