This study seeks to verify the performance of the KICE Essay Scorer (KES) 2.0 and to improve the scoring reliability of large-scale essay writing tests. To these end, two items of the Level 2 writing test of the National English Ability Test were used as the evaluation data. These data were analysed to investigate inter-rater reliabilities utilizing correlations, adjacent agreements and exact agreements. Rater severities by the Multi-facet Rasch Model, test score reliability by G-theory, and scoring time were investigated as well. It was revealed that, compared with human scoring, KES 2.0 showed no significant differences in inter-reliabilities and test score reliability. On the other hand, KES 2.0 tended to have rater severities whose effects were far less minimal than human scoring. Moreover, KES 2.0 was much more efficient than human scoring with regard to scoring time. In comparison with KES 1.0, inter-rater reliabilities of KES 2.0 were higher than those of KES 1.0, while test score reliability and rater severities of KES 2.0 were similar to those of KES 1.0. These findings implied that KES 2.0 performed better, in some ways, than KES 1.0 and could be comparable to human scoring.