Introduction
Social imedia iis ia iplatform iwhich iprovides ia ikind iof ivirtual ilife ifor ia ipeople ito iopenly iexpress ifeeling, iopinions iand ibeliefs. iVarious ievents iare iaffected iby ipeoples isentiments iwhich iprovided iby ithe isocial imedia. iUnfortunately, ihateful ispeech iand iabusive ilanguage itarget ior iharassed ithe imany iusers iengaging ionline, ieither ion isocial imedia ior iforums. iSo ibefore ipublication iof ithe iany ionline icontent ithere iis ia ineed iof idetecting isuch ihateful iand iabusive ilanguage. iThe iservices iprovided iby idifferent ionline isocial inetworks ilike iTwitter, iInstagram iand iFacebook iare iused iby ipeople ifrom idifferent ibackgrounds, iinterest iand iculture.
iSo ithe icommunication ibetween ithese ivarious ibackground ipeoples iare iincreased iday iby iday iwhich iresults iinto imore icyber iconflicts ibetween ithese ipeoples. iAccording ito ithe imost inational iand iinternational ilegislations ihate ispeech iis ireferred ias iexpressions ithat iprompts ito iharm, idiscrimination, ihostility iand iviolence ibased ion iidentifies isocial igroup ior idemographic igroup. iHate ispeech ican iinclude iany iform iof iexpressions isuch ias iimage, ivideos, isongs ias iwell ias iwritten icomments.
iWritten icomments iare iconsidered iin ithis iresearch. iSo ithe ihate ispeech idetection iis icalled ias ia itext ior icontent ispecific ianalytics itask.
In imany icountries iincluding iUnited iKingdom, iCanada iand iFrance ithere iare ilaws iprohibiting ihate ispeech isocial imedia iservices isuch ias ifacebook, itwitter iwhich inot ihaving ienough iprovision ifor ihate ispeech ior ispecific irace? iThese iwebsites iare iopen ispace ifor ipeople ito idiscuss iand ishare ithoughts iand iopinions iand imakes iit ialmost iimpossible ito icontrol itheir icontents. iFurthermore imany ipeople itend ito iuse iaggressive iand ihateful ilanguage.
iUltimately, ithese iall ithing iare ipointed itowards ithe iautomatic ihate ispeech idetection.
There iare imany isolutions iwhich idepends ion ithe iNatural iLanguage iProcessing iare iavailable ifor iautomatic ihate ispeech idetection. iBut iit ihaving idrawback, iNLP iapproaches iare itotally idepends ion ilanguage iused iin ithe itext. iThis igives ithe imotivation ito imachine ilearning itechniques ilike ineural inetworks ifor iclassification itask iwhich imostly iuses ithe ipre-trained ivectors i(e.g. iGlove, iWord2Vec) ias iword iembedding iand iachieve ibetter iresult ifrom iclassification imodel. iIf iuser iuses ishort islang iword ifor ihate ispeech ithen ithese imachine ilearning itechniques ifail ito idetect iit. iTo iovercome ithis iproblem inext isentiment ipolarities iare idetected iin ithe itweets i(Barnaghi iet ial., i2016) ito idifferentiate ithe ispeech. iAt ithe isame itime iunsupervised ilearning imodels i(15) iare ialso iused ifor ithe ihate ispeech idetection. iDeep ilearning iapproaches i(2, i4, iand i5) ialso igive ithe ibetter iresult iin ithe ihate ispeech idetection. i
Our iapproach iemploys ideep ilearning iarchitecture ifor ithe itext iclassification, ia irecurrent ineural inetworks icomposed iof iLong-Short-Term-Memory i(LSTM) ibased iclassifiers. iThen iwe ishow iexperimental ievaluation iof imodel iwith iother ialgorithms. i i
Goals iand iObjectives
The iprimary iaim iof ithis iresearch iis iHate ispeech idetection ion iTwitter. iLong-Short-Term-Memory i(LSTM) iwhich iis ia ipart iof irecurrent ineural inetwork i(RNN) iand ibasics iof iRNN ithese itwo iclassifications ialgorithms iallowed itogether ion isample idate iset ito ifind ieffectiveness iof iproposed iapproach ifor ifinding ihateful ispeech. iFurther ithe iresearch iwork ifocuses ion ithose ihate ispeech iand itheir iusers ivarious ifeatures ilike inumber iof ifollowers, inumber iof ifollowee, idate iof icreate iaccount, ietc. iand ipredict ithe ihateful iuser iwho imay igives ialways ithe ihateful ispeech.
The iscope iof ithis idissertation iincludes i
1. Increase iour iunderstanding iof iHate iSpeech ion isocial imedia isites ilike iTwitter.
2. Detection iof ithe iHateful itweets ithen icategories iit iinto ithree icategories: iHateful, iOffensive iand iClean.
3. Analyze ithe ieffectiveness iof ithe iproposed ideep ilearning iapproach ii.e. iLSTM.
4. Giving ithe iprediction iabout ihateful iusers.
Motivation
Among ivariety iof ionline isocial inetworking iwebsites, isuch ias iTwitter, iFacrbook, iYouTube, iInstagram, iLinkedin ietc. iTwitter iis ithe ione ion iwhich imostly iteenagers iand iadolescents iare iactive. iIt irepresents ithe iimportance iof idetection iand iremoval iof ionline ihate icontent. iCurrently ithere iare iquite ilot iof iresearches idone ifor iautomatic idetection iof ionline ihate ispeech. iThe imethod ifor idetecting ithe ihate ispeech ican ibe idivided iinto itwo icategories: ione imanual ifeature iengineering iwhich iis iconsumed iby ialgorithm isuch ias iSVM, iNa?ve iBayes iand iLogistic iRegression iand ianother irepresents ideep ilearning iparadigm. i
In ithis iresearch ifor iautomatic idetection iof ihate ispeech iwe iuse ideep ilearning imethod ii.e. icombination iof iConvolutional iNeural iNetwork i(CNN) iand iLong-Short-Term-Memory i(LSTM) ia itype iof irecurrent ineural inetwork. iIntuitively, ithe iCNN ilearns ifeatures isimilar ito in-gram isequence iand iLSTM isequence iorder ithat iare iboth iuseful ifor iclassification. iThe imachine ilearning imethods ii.e. isupervised ialgorithms iand iunsupervised ialgorithms ii.e. iNa?ve iBayes, iSVM, iLogistic iRegression i(18) iuses ithe idictionaries iand imay ichoose iwrong isentence ifor ihate ispeech ialso idecrease ithe iaccuracy. iThe ideep ineural inetwork iarchitecture ifor ihate ispeech idetection i(5) ioutperforms ithan ithe imachine ilearning imethods. iIn ithis iresearch iwe iuse iternary iclassification ii.e. iwhether ithe itweet iis iClean, iOffensive ior iHateful i(1). iThe ioffensive imeans iit imay ibe iracism. iLastly ithe ifeatures iof ithe iuser ialso iuseful iin ithe ihate ispeech idetection. iSo, ithe ivarious ifeatures iof ithe ihateful iusers iobserve ito ipredict ithe ialways ihateful iusers.
Literature iSurvey
Introduction i
In ithe isimple isentence iword ibased iapproach, ifail ito iidentify ithe ihate ispeech iand ioffensive ispeech ias iwell ias iaffect ion ithe ispeech iexpressions iand iemotions ifreedom. iMost iof iwords iand isentences ican ihave imany idifferent imeanings iin idifferent icontext iwhich iis icalled iAmbiguity iProblem. iBecause iof iambiguity iproblem ifalse ipositive irate iis ihigh ii.e. iincreased. iSo iword ibased iapproach iis iavoided iin ihate ispeech idetection. iAlso iin iNatural ilanguage iProcessing i(NLP) iapproaches iare inot ieffective ito idetect icomment iby iuser ior iunusual ispellings. iSo iit iis icalled iSpelling iVariation iProblem iwhich iis icaused iby ithe ireplacement iof icharacter iin ia itoken. iHate ispeech idetection iwas idone iby idifferent itype. iFirstly iLexical iBased iApproach iin iwhich imachine iuses ilanguage ipattern, igrammar, irules icreated imanually. iN.D, iGitary iet ial i(4) ipresents ithe imodel iwhich iuses ilexicon ito ifind ihate ispeech. iSecond iis imachine ilearning iapproach. iDavidson iet ial i(1) ipresents imulti-classifier imodel ifor iclassification iof ihate ispeech, ioffensive iand iclean. iHajime iWatanabe iet ial i(11) iuses in-gram ifeature ito iclassify itweets iin irecommended ithree iclasses. iThird iis iHybrid iapproach iin iwhich ilearning ibased ias iwell ias ilexical ibased iapproach i(8) iis iused. iLastly iPinkesh iBadjatiya iet ial i(5) icompare ithe ideep ilearning imethods ii.e. iFast iText, iCNN iand iLSTM ialso itask ispecific iembeddings ilearned iusing ithese ithree imethods. i
Previous iWork
There ihas isome iwork idone ion ithis itopic, ispecially ihate ispeech idetection ion isocial imedia isuch ias iYouTube, iTwitter, iFacebook ietc iincluding ionline icontent ias iwell ias ivarious idata isets.
Year Title Authors Description
2018 Hate iSpeech iDetection iusing iConvolution-LSTM iBased iDeep iNeural iNetwork. Zigi iZhang, iDavid iRobinson, iJonathan iTepper. Deep ineural inetwork iis iused ifor iclassification iby icombines ithe iconvolutional iand iLSTM inetworks iwith idrop-out iand ipooling iwhich iimproves iclassification iaccuracy.
2018 Detecting iOffensive iLanguage iin iTweets iusing iDeep iLearning. Georgios iPitsilis, iHeri iRamampioro, iHelge iLangseth. By iusing ithe iuser ibehavioral icharacteristics iand ifeatures iof ithe iuser, imultiple iLSTM ibased iclassifiers iare iused ifor iclassification iof ihate ispeech iin itwo icategories.
2017 Using iConvolutional iNeural iNetworks ito iClassify iHate iSpeech. Bjorn iGamback, iUtpal iKumar iShidar. Here iDeep ilearning imethod ii.e. iConvolutional ineural inetwork i(CNN) iis iused ito iclassify itweets iin ifour icategories iby ifour iCNN itrained imodel ifor ieach iclass. iThe ifeature iset iis imap iin inetwork iby imax-pooling iand isoftmax ifunction ifor iclassification.
2017 Improving iHate iSpeech iDetection iwith iDeep iLearning iEnsembles. Steven iZimmerman, iChris iFox, iUdo iKruschwitz. It icreates iensemble imodel iby itaking iaverage iof isoftmax iresult iof ivarious imodels. iIt ishows ithat iin ideep ilearning iweight iinitialization imethod ihave iimportant irole.
2017 Deep iLearning ifor iHate iSpeech iDetection iin iTweets. Pinkesh iBadjatiya, iShashank iGupta, iVasudeva iVarma. Multiple ideep ilearning iarchitectures iare iexperiment ito ilearn isemantic iword iembeddings. iWhen iRandom iembeddings ifrom ideep ineural inetwork imodel iand iGradient iboosted idecision itrees iare icombine, iit igives ibest iresult.
2017 One iStep iand iTwo iStep iClassification ifor iAbusive iLanguage iDetection ion iTwitter. Ji iHo iPark, iPascale iFung. Ti igives itwo istep iclassification ifor iabusive ilanguage idetection iand iclassify iinto iracism, isexism iand iclean. iIt iused iHybrid iCNN iwhich itakes iboth iword iand icharacter ifeatures ias iinput. iTwo istep iapproach igives ibetter iresult ithan ione istep iapproach.
2016 Abusive iLanguage iDetection iin iOnline iUser iContent. Chikashi iNobata, iJoel iTetreault, iAchint iThomas. Here isupervised iclassification imethod iwhich iuses ifour iNLP ifeatures i(n-gram, iLinguistic, iSyntactic iand iDistributional iSemantics) iare iused. iWhen ithese ifeatures icombined iwith istandard iNLP ifeatures, iresult iincreased.
2015 Hate iSpeech iDetection iwith iComment iEmbeddings. Nemanja iDjuric, iJing iZhou, iRobin iMorris. In ithis ipaper, iparagraph2vec ifor icomments iand iwords ijoint imodeling iused iwhich iis ilearn iby icontinuous iBOW inatural ilanguage imodel. iThen ito idistinguish ibetween ihateful iand iclean, iembedding iis iused ito itrain ia ibinary iclassification. iIt iaddress iissues iof iscarcity iand ihigh idimensionality.
2015 A iLexicon iBased iApproach ifor iHate iSpeech iDetection. Njagi iDennis, iZhang iZuping, iJun iLong. i In ithis ipaper, imodel ithe iclassifier iwhich iuses isentiment ianalysis itechniques ifor isubjectivity idetection. iIt icalculate irate iof ithe ipolarity iof isentiment iexpressions iand iremove iobjective isentences. iThen ibootstrapping iis iused ifor iclassification. i
2004 Classifying iRacist iText iusing iSupport iVector iMachine. Edel iGreevy, iAlan iF. iSmeaton. SVMs iare iused ito iautomatically iclassify ithe iweb ipages. iIt iuses ibigram irepresentation iof ia iweb ipage iwithin ia iSVM. i
In ithe ipaper i[21], isentiment, isemantic, ipattern iand iUnigram ifeature iare iused ito iclassify ithe itweets. iIt iprovides ithe ibinary iand iternary iclassification iwith i87.4% iand i78.4% iaccuracy irespectively. iThis iresult iis ithe iaverage iof iall ifeatures ii.e. isentiment ifeature, isemantic ifeature, ipattern ifeature iand iunigram ifeature. iIn ithe iwork i[3], iConvolutional iNeural iNetwork i(CNN) imodels ii.e. iRandom ivectors, iword2vec iand icharacter in-gram iis iused iseparately iand ithen icalculates ithe iaverage iof itheir iresults. iCNN imodels igive ithe i78.3% iaccuracy. i iIn iadvanced ito iprior iwork, iBjorn iGamback iet ial i[1], icombination iof iconvolution ineural inetwork iand ilong ishort iterm imemory iis iapplied ifor ithe iclassification. iIt iexperiments ion ithe iseven idata iset, ifive iof ithem iclassified iin iracism iand isexism, ione iis iclassified iin iRefugee iand iMuslim iand ione iis iclassified iin igeneral i(hate iand inon-hate). iThere iis ivariation iin ithe iresult ifor ivarious idata isets iby iusing ithe iCNN iand iLSTM. iWhile iin ithe iwork i[10], itwo istep iclassification iis iused, ione iis ifor iclassify ithe iabusive ilanguage iand ianother iis ifor ihate ispeech iclassification. iIt iuses ithe iHybridCNN iwhich igives ithe i95% iaccuracy. iIn ithe ipaper i[19], iconvolution ineural inetwork iis iused ialong iwith iGated iRecurrent iUnit i(GRU), ia ipart iof irecurrent ineural inetwork ii.e. iconvolutional iGRU. iConvolutional iGRU iis iwork ibetter ion ismall idata iset. i
Gap iin iExisting iLiterature
The iexisting iwork ihave ivarious igaps iin iit, ithe iwork iis ion ipsychological, ibehavioral iand ipersonal ireasons. iThere iare ivarious ideep ilearning imethods iused ifor ihate ispeech idetection, iwhich igives ibetter iresult. iWe ihave ito ianalyze ithe iall imethods, ialgorithms iand iensemble ithe ideep ilearning ialgorithms ito iimprove ithe idetection iof ihate ispeech. iAlso iall ihate ispeech idetection ifocuses ion ithe icontent ion isocial imedia. iBut ivery ifew ifocus ion iuser iand itheir ibehavior. iUser iis ithe istarting ipoint ifor ithe ihate ispeech, iso itheir imust ifocus ion iusers ifeature. iThis ilimitation iis iovercome iby ishifting ifocus ifrom itweets ito ihateful iuser. iSo iin ithis iresearch, iafter iclassification iin ihateful, ioffensive iand iclean, iwe ianalyze ithe ifeatures iof iuser iof ihateful ispeech iand igiving ithe iprediction iof ihateful iuser.
Proposed iSystem
The idetection iof ioffensive iand ihate ispeech ion itwitter iis ia icrucial ifunctionality ifor ian iensemble iapproach itowards itackling iharassment iand imisdemeanor. iQuick iand irigorous idetection iof ihate ispeech ican iresult iin ia itimely ireaction iof ithe iusers iand iuploader ifor iremoval iof ithe itweets ior iother iresponses ion ithe itwitter. iHowever, ione istep iahead iattempt ican ibe imade ito iimpede ihate ispeech ion itwitter. iIn ithis ichapter ithe igeneral iresearch iframework iis ito ifocuses ion ithe idetection iof isuch ihate ispeech iand iuser ifeatures ifor iprediction iwith ithe ihelp iof iclassification ialgorithms. iThe isolution iproposed iis iin ithis itype iof ianalysis ibased ion idata igenerated iby iusers ion itwitter. iThe ichapter iincludes ithe iinsight ichallenges ithat iare iessential ito ibe iexplore iand istudied ito ifulfill idevelopment iof iproposed iresearch iwork. iAlong iwith ithe ihate ispeech idetection ithe iprediction iof ithe iusers iwho ialways igives ihate ispeech iis ialso idone ifor ibetter iresult.
Figure i3.1 ipresents ithe iarchitecture ifor ithe iproposed isolution iapproach. iA iframework ifor ithe iresearch igoal iof idetecting ihate ispeech iuses ia i16k iannotated idata iset itaken ifrom itwitter iand iverified iby iexperts. iThis iresearch icomprises ithe iproblem iof ihate ispeech idetection iin ithree icategories ii.e. iHateful, iOffensive iand iClean.
The iproposed iapproach iis ia imulti-step iprocess iconsists iof i4 iphases- itraining iand ithe itesting idata icollection, iFeature iextraction iand iclassification iof itweet ibased ion iclassification ialgorithms.