- Lt- 1end if compute tt+1

end fo r

1. how far the algorithm has progresse d (t) 2. the magnitude of the increase in empirical loss (Lt

) 3. the balance of the training distributions ( ß )

The motivation b ehind the third heuristic is that in the early stages of the algorithm, unbalanced training distributions lead to aggregate vectors that are skewed toward the dominant class. I f the pro ce dure is stopp e d to o early, the e mpiric al loss will b e dis prop ortionately high for the s ub dominant clas s, leading to a s kewe d weight vector. The e ﬀect of ß is to relax the timing schedule for imbalanced data w hich results in higher quality solutions.

The TAP optimisation pro cedure requires storage of the input vectors along with the feature weight and up date ve ctors, yielding space complexity of O (N ) in the numb er of training samples. At each iteration, computation of the empirical loss and aggregate vector is O (N · ν ) (recall that ν is the average numb er of unique features p e r sample). Given the curre nt and previous loss value s, computing t is O (1) and thus each ite ration scales with time complexity O (N ) in the numb e r of training samples. The numb e r of training iterations is governed by the rapidity of the timing schedule which has no direct dep e nde nce on the numb er of training s amples, yielding an approximate overall complexity of O (N ) (linear) in the numb er of training s amples .

The TAP and SVM mo dels describ ed ab ove p erform binary disc riminative clas siﬁcation, in which training exam scripts mus t b e divided into ‘pass’ and ‘fail’ categories . The conﬁdence margin ge nerated by the classiﬁe r on a given test sc ript can b e inte rpreted as an e stimate of the degree to which that script has passed or failed, e.g. a ‘go o d’ pass or a ‘bad’ fail. However, this gradation of script quality is not mo de lled explicitly by the classiﬁer, rather it relies on emerge nt correlation of key f eatures with script quality.

In this section, we intro duce an alte rnative ML technique c alled preferenc e ranking which is b e tter suited to the AAE T task. I t explicitly mo dels the relationships b etween sc ripts by learning an optimal ranking over a given sample domain, inferred through an optimisatio

pro cecure that utilises a sp ec iﬁed orde ring on training samples . This allows us to mo del the fact that some sc ripts are ‘b etter’ than others, across

an arbitrary grade range, without nece ssarily having to sp ecif y a numerical score for each, or intro duce an arbitrary pass /fail b oundary.

We now pres ent a version of the TAP algorithm that eﬃciently learns pref erence ranking mo dels. A de rivation of similar equations for learning SVM- base d mo dels , and pro of of their optimality is give n by Joachims (2002).

The TAP preference ranking optimis ation pro cedure requires a set of training samples, x1, x2, . . . , xn, and a ranking

where µ is the margin, given a sp ec iﬁc value b elow. The derived set of pairwise diﬀ ere nce vectors grows quickly

as a function of the numb er
of training samples. An upp er b ound on the numb er of diﬀerence vectors f or a se t of training vectors is give n by

where r is the numb er of ranks and a is the average rank frequency. This yields intrac table numb e rs of diﬀ erence vec tors for eve n mo dest numb ers of training

vectors, eg: r = 4, a = 2000 yields 24, 000, 000 diﬀerenc e vectors. To overcome this, the TAP optimisation pro cedure employs a sampling strategy to re duce

the numb er of diﬀerenc e vec tors to a manageable quantity. An upp e r b ound is sp e ciﬁed on the numb er of training vectors, and then the probability of s ampling an arbitrary diﬀerence ve ctor is given by u /u where u is the sp eciﬁe d upp er b ound and u is given ab ove.

The optimisation algorithm the n pro ceeds as for the classiﬁc ation mo del (Algorithm 1), except we have a one- sided margin. The mo diﬁed pro cedure is shown in Algorithm 2