<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://awesomepapers.io/large-language-models/feed/publications.xml" rel="self" type="application/atom+xml" /><link href="https://awesomepapers.io/large-language-models/" rel="alternate" type="text/html" /><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/feed/publications.xml</id><title type="html">Awesome LLM Papers | Publications</title><subtitle>A continuously updated collection of research papers on LLMs. Maintained by &lt;a href=&quot;https://sjmoran.github.io&quot;&gt;Sean Moran&lt;/a&gt;.</subtitle><author><name>Sean Moran</name><email></email></author><entry><title type="html">Clicr: A Dataset Of Clinical Case Reports For Machine Reading Comprehension</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%A1uster2018clicr/" rel="alternate" type="text/html" title="Clicr: A Dataset Of Clinical Case Reports For Machine Reading Comprehension" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%A1uster2018clicr</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%A1uster2018clicr/"><![CDATA[<p>We present a new dataset for machine comprehension in the medical domain. Our
dataset uses clinical case reports with around 100,000 gap-filling queries
about these cases. We apply several baselines and state-of-the-art neural
readers to the dataset, and observe a considerable gap in performance (20% F1)
between the best human and machine readers. We analyze the skills required for
successful answering and show how reader performance varies depending on the
applicable skills. We find that inferences using domain knowledge and object
tracking are the most frequently required skills, and that recognizing omitted
information and spatio-temporal reasoning are the most difficult for the
machines.</p>]]></content><author><name>Sean Moran</name></author><category term="Uncategorized" /><summary type="html"><![CDATA[We present a new dataset for machine comprehension in the medical domain. Our dataset uses clinical case reports with around 100,000 gap-filling queries about these cases. We apply several baselines and state-of-the-art neural readers to the dataset, and observe a considerable gap in performance (20% F1) between the best human and machine readers. We analyze the skills required for successful answering and show how reader performance varies depending on the applicable skills. We find that inferences using domain knowledge and object tracking are the most frequently required skills, and that recognizing omitted information and spatio-temporal reasoning are the most difficult for the machines.]]></summary></entry><entry><title type="html">Fly-swat Or Cannon? Cost-effective Language Model Choice Via Meta-modeling</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%A1akota2023fly/" rel="alternate" type="text/html" title="Fly-swat Or Cannon? Cost-effective Language Model Choice Via Meta-modeling" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%A1akota2023fly</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%A1akota2023fly/"><![CDATA[<p>Generative language models (LMs) have become omnipresent across data science.
For a wide variety of tasks, inputs can be phrased as natural language prompts
for an LM, from whose output the solution can then be extracted. LM performance
has consistently been increasing with model size - but so has the monetary cost
of querying the ever larger models. Importantly, however, not all inputs are
equally hard: some require larger LMs for obtaining a satisfactory solution,
whereas for others smaller LMs suffice. Based on this fact, we design a
framework for cost-effective language model choice, called “Fly-swat or cannon”
(FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously
assigns each input to an LM predicted to do well on the input according to a
so-called meta-model, aiming to achieve high overall performance at low cost.
The cost-performance tradeoff can be flexibly tuned by the user. Options
include, among others, maximizing total expected performance (or the number of
processed inputs) while staying within a given cost budget, or minimizing total
cost while processing all inputs. We evaluate FORC on 14 datasets covering five
natural language tasks, using four candidate LMs of vastly different size and
cost. With FORC, we match the performance of the largest available LM while
achieving a cost reduction of 63%. Via our publicly available library,
researchers as well as practitioners can thus save large amounts of money
without sacrificing performance.</p>]]></content><author><name>Sean Moran</name></author><category term="Uncategorized" /><summary type="html"><![CDATA[Generative language models (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural language prompts for an LM, from whose output the solution can then be extracted. LM performance has consistently been increasing with model size - but so has the monetary cost of querying the ever larger models. Importantly, however, not all inputs are equally hard: some require larger LMs for obtaining a satisfactory solution, whereas for others smaller LMs suffice. Based on this fact, we design a framework for cost-effective language model choice, called “Fly-swat or cannon” (FORC). Given a set of inputs and a set of candidate LMs, FORC judiciously assigns each input to an LM predicted to do well on the input according to a so-called meta-model, aiming to achieve high overall performance at low cost. The cost-performance tradeoff can be flexibly tuned by the user. Options include, among others, maximizing total expected performance (or the number of processed inputs) while staying within a given cost budget, or minimizing total cost while processing all inputs. We evaluate FORC on 14 datasets covering five natural language tasks, using four candidate LMs of vastly different size and cost. With FORC, we match the performance of the largest available LM while achieving a cost reduction of 63%. Via our publicly available library, researchers as well as practitioners can thus save large amounts of money without sacrificing performance.]]></summary></entry><entry><title type="html">Inference-time Hyper-scaling With KV Cache Compression</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2025inference/" rel="alternate" type="text/html" title="Inference-time Hyper-scaling With KV Cache Compression" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2025inference</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2025inference/"><![CDATA[<p>Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8times compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.</p>]]></content><author><name>Sean Moran</name></author><category term="Efficiency" /><summary type="html"><![CDATA[Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8times compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.]]></summary></entry><entry><title type="html">Fastpitch: Parallel Text-to-speech With Pitch Prediction</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2020fastpitch/" rel="alternate" type="text/html" title="Fastpitch: Parallel Text-to-speech With Pitch Prediction" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2020fastpitch</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%82a%C5%84cucki2020fastpitch/"><![CDATA[<p>We present FastPitch, a fully-parallel text-to-speech model based on
FastSpeech, conditioned on fundamental frequency contours. The model predicts
pitch contours during inference. By altering these predictions, the generated
speech can be more expressive, better match the semantic of the utterance, and
in the end more engaging to the listener. Uniformly increasing or decreasing
pitch with FastPitch generates speech that resembles the voluntary modulation
of voice. Conditioning on frequency contours improves the overall quality of
synthesized speech, making it comparable to state-of-the-art. It does not
introduce an overhead, and FastPitch retains the favorable, fully-parallel
Transformer architecture, with over 900x real-time factor for mel-spectrogram
synthesis of a typical utterance.</p>]]></content><author><name>Sean Moran</name></author><category term="Model Architecture" /><summary type="html"><![CDATA[We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance.]]></summary></entry><entry><title type="html">BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%82ajszczak2024base/" rel="alternate" type="text/html" title="BASE TTS: Lessons From Building A Billion-parameter Text-to-speech Model On 100K Hours Of Data" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%82ajszczak2024base</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%82ajszczak2024base/"><![CDATA[<p>We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes (“speechcodes”) followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported “emergent abilities” of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.</p>]]></content><author><name>Sean Moran</name></author><category term="Evaluation" /><summary type="html"><![CDATA[We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes (“speechcodes”) followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported “emergent abilities” of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.]]></summary></entry><entry><title type="html">Explainability For Transparent Conversational Information-seeking</title><link href="https://awesomepapers.io/large-language-models/publications/%C5%82ajewska2024explainability/" rel="alternate" type="text/html" title="Explainability For Transparent Conversational Information-seeking" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C5%82ajewska2024explainability</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C5%82ajewska2024explainability/"><![CDATA[<p>The increasing reliance on digital information necessitates advancements in
conversational search systems, particularly in terms of information
transparency. While prior research in conversational information-seeking has
concentrated on improving retrieval techniques, the challenge remains in
generating responses useful from a user perspective. This study explores
different methods of explaining the responses, hypothesizing that transparency
about the source of the information, system confidence, and limitations can
enhance users’ ability to objectively assess the response. By exploring
transparency across explanation type, quality, and presentation mode, this
research aims to bridge the gap between system-generated responses and
responses verifiable by the user. We design a user study to answer questions
concerning the impact of (1) the quality of explanations enhancing the response
on its usefulness and (2) ways of presenting explanations to users. The
analysis of the collected data reveals lower user ratings for noisy
explanations, although these scores seem insensitive to the quality of the
response. Inconclusive results on the explanations presentation format suggest
that it may not be a critical factor in this setting.</p>]]></content><author><name>Sean Moran</name></author><category term="Uncategorized" /><summary type="html"><![CDATA[The increasing reliance on digital information necessitates advancements in conversational search systems, particularly in terms of information transparency. While prior research in conversational information-seeking has concentrated on improving retrieval techniques, the challenge remains in generating responses useful from a user perspective. This study explores different methods of explaining the responses, hypothesizing that transparency about the source of the information, system confidence, and limitations can enhance users’ ability to objectively assess the response. By exploring transparency across explanation type, quality, and presentation mode, this research aims to bridge the gap between system-generated responses and responses verifiable by the user. We design a user study to answer questions concerning the impact of (1) the quality of explanations enhancing the response on its usefulness and (2) ways of presenting explanations to users. The analysis of the collected data reveals lower user ratings for noisy explanations, although these scores seem insensitive to the quality of the response. Inconclusive results on the explanations presentation format suggest that it may not be a critical factor in this setting.]]></summary></entry><entry><title type="html">Aya Model: An Instruction Finetuned Open-access Multilingual Language Model</title><link href="https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2024aya/" rel="alternate" type="text/html" title="Aya Model: An Instruction Finetuned Open-access Multilingual Language Model" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2024aya</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2024aya/"><![CDATA[<p>Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at https://hf.co/CohereForAI/aya-101</p>]]></content><author><name>Sean Moran</name></author><category term="Evaluation" /><category term="Fine Tuning" /><summary type="html"><![CDATA[Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at https://hf.co/CohereForAI/aya-101]]></summary></entry><entry><title type="html">Udapter: Language Adaptation For Truly Universal Dependency Parsing</title><link href="https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2020udapter/" rel="alternate" type="text/html" title="Udapter: Language Adaptation For Truly Universal Dependency Parsing" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2020udapter</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C3%BCst%C3%BCn2020udapter/"><![CDATA[<p>Recent advances in multilingual dependency parsing have brought the idea of a
truly universal parser closer to reality. However, cross-language interference
and restrained model capacity remain major obstacles. To address this, we
propose a novel multilingual task adaptation approach based on contextual
parameter generation and adapter modules. This approach enables to learn
adapters via language embeddings while sharing model parameters across
languages. It also allows for an easy but effective integration of existing
linguistic typology features into the parsing network. The resulting parser,
UDapter, outperforms strong monolingual and multilingual baselines on the
majority of both high-resource and low-resource (zero-shot) languages, showing
the success of the proposed adaptation approach. Our in-depth analyses show
that soft parameter sharing via typological features is key to this success.</p>]]></content><author><name>Sean Moran</name></author><category term="Uncategorized" /><summary type="html"><![CDATA[Recent advances in multilingual dependency parsing have brought the idea of a truly universal parser closer to reality. However, cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel multilingual task adaptation approach based on contextual parameter generation and adapter modules. This approach enables to learn adapters via language embeddings while sharing model parameters across languages. It also allows for an easy but effective integration of existing linguistic typology features into the parsing network. The resulting parser, UDapter, outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. Our in-depth analyses show that soft parameter sharing via typological features is key to this success.]]></summary></entry><entry><title type="html">Building Foundations For Natural Language Processing Of Historical Turkish: Resources And Models</title><link href="https://awesomepapers.io/large-language-models/publications/%C3%B6zate%C5%9F2025building/" rel="alternate" type="text/html" title="Building Foundations For Natural Language Processing Of Historical Turkish: Resources And Models" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C3%B6zate%C5%9F2025building</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C3%B6zate%C5%9F2025building/"><![CDATA[<p>This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.</p>]]></content><author><name>Sean Moran</name></author><category term="Uncategorized" /><summary type="html"><![CDATA[This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.]]></summary></entry><entry><title type="html">Neural Machine Translation For Low-resource Languages</title><link href="https://awesomepapers.io/large-language-models/publications/%C3%B6stling2017neural/" rel="alternate" type="text/html" title="Neural Machine Translation For Low-resource Languages" /><published>2026-02-26T07:19:39-06:00</published><updated>2026-02-26T07:19:39-06:00</updated><id>https://awesomepapers.io/large-language-models/publications/%C3%B6stling2017neural</id><content type="html" xml:base="https://awesomepapers.io/large-language-models/publications/%C3%B6stling2017neural/"><![CDATA[<p>Neural machine translation (NMT) approaches have improved the state of the
art in many machine translation settings over the last couple of years, but
they require large amounts of training data to produce sensible output. We
demonstrate that NMT can be used for low-resource languages as well, by
introducing more local dependencies and using word alignments to learn sentence
reordering during translation. In addition to our novel model, we also present
an empirical evaluation of low-resource phrase-based statistical machine
translation (SMT) and NMT to investigate the lower limits of the respective
technologies. We find that while SMT remains the best option for low-resource
settings, our method can produce acceptable translations with only 70000 tokens
of training data, a level where the baseline NMT system fails completely.</p>]]></content><author><name>Sean Moran</name></author><category term="Evaluation" /><category term="Safety &amp; Alignment" /><category term="Survey Paper" /><summary type="html"><![CDATA[Neural machine translation (NMT) approaches have improved the state of the art in many machine translation settings over the last couple of years, but they require large amounts of training data to produce sensible output. We demonstrate that NMT can be used for low-resource languages as well, by introducing more local dependencies and using word alignments to learn sentence reordering during translation. In addition to our novel model, we also present an empirical evaluation of low-resource phrase-based statistical machine translation (SMT) and NMT to investigate the lower limits of the respective technologies. We find that while SMT remains the best option for low-resource settings, our method can produce acceptable translations with only 70000 tokens of training data, a level where the baseline NMT system fails completely.]]></summary></entry></feed>