Article Text

Language, terminology and the readability of online cancer information
  1. Pam Peters1,
  2. Adam Smith1,
  3. Yasmin Funk1,
  4. John Boyages2
  1. 1Department of Linguistics, Macquarie University, North Ryde, New South Wales, Australia
  2. 2Macquarie University Cancer Institute, Macquarie University, North Ryde, New South Wales, Australia
  1. Correspondence to Emeritus Professor Pam Peters, Department of Linguistics, Macquarie University, North Ryde, NSW 2109, Australia; pam.peters{at}


Medical terms are a recognised problem in doctor–patient consultations. By contrast, the language difficulties of online healthcare documents are underestimated, even though patients are often encouraged to go to the internet for information. Literacy levels in the community vary, and for patients, carers and health workers with limited reading skills (including first- and second-language users of English), the language of web-based health documents may be challenging or impenetrable. Online delivery of health information is inherently problematic because it cannot provide two-way discussion; and amid the range of health documents on the web, the intended readership (whether general or specialist) is rarely indicated up front. In this research study, we focus on the language and readability of web-based cancer documents, using lexicostatistical methods to profile the vocabularies in two large test databases of breast cancer information, one consisting of material designed for health professionals, the other for the general public. They yielded significantly different word frequency rankings and keyness values, broadly correlating with their different readerships, that is, scientifically literate readers for the professional dataset, and non-specialist readers for the public dataset. The higher type/token ratio in the professional dataset confirms its greater lexical demands, with no concessions to the variable language and literacy skills among second-language health workers. Their language needs can, however, be addressed by a new online multilingual termbank of breast cancer vocabulary, HealthTermFinder, designed to sit alongside health documents on the internet, and provide postconsultation help for patients and carers at their point of need.

  • Cancer care
  • Linguistics

Statistics from


Problems caused by medical terminology

Medical terminology is one of the ‘linguistic jungles’ of science,1 a habitat swarming with myriad terms from medical research and practice, allied health sciences, pharmacology and innumerable therapeutic products. The terminological mix is a challenge for health professionals, especially those for whom English is a second language. Australia, like other developed countries, has been actively recruiting doctors, dentists and nurses from non-English-speaking countries for the last 30 years.2 Their proficiency in English varies according to their source countries, generally lower for those from Egypt than for those from South Africa.2

Medical terms are often a problem for patients and carers because of low levels of community literacy. Just under half the Australian population (46%) falls short of the minimum prose literacy skills, defined as the ability to ‘compare and contrast written information’ (or) ‘extract information from a pamphlet’ (Australian Bureau of Statistics 2006).3 That minimum is the current benchmark for ‘coping with the increasing demands of the emerging knowledge society and information economy’, but its relationship to health literacy is unclear.4 The terminology used in medical consultations may be baffling—if not intimidating— especially to patients whose first language is not English, contributing to insecurity and anxiety.5 The affective force of medical language was demonstrated experimentally by Omer et al6 in a British study of the alternative terminology used by health professionals when explaining ductal carcinoma in situ (DCIS) to patients. They found that most women switched their treatment preferences from surgery to active surveillance when DCIS was discussed in terms of ‘abnormal cells’ rather than ‘non-invasive cancer’.

Medical terminology and diagnostic categories, such as those associated with fine-needle aspiration (FNA) cytology for breast cancer, can be problematic for specialists themselves. In Howell's research, Californian data from 822 applications of the four standard diagnostic categories used for breast cancer (benign, malignant, suspicious/probably malignant, atypical/indeterminate) were re-examined to determine the usefulness of separating the third and fourth categories, because both are equivocal and usually require surgical follow-up, whichever label is applied.7 Howell recommended combining the third and fourth categories into one, as ‘suspicious/equivocal’ for better communication through the assessment process. This suggests that ‘delicacy’ in any diagnostic system and its terminology need to be set against logistical and communication issues for all concerned. Such terms belong to the jargon of medicine—terminology that excludes all but health professionals—and need to be explained to patients,8 especially those for whom English is a second language.

Semitechnical terms and abbreviations

An essential problem with English medical terms is that many, especially those used in assessment and treatment, come from the common language of science and are inherently polysemous, that is, they have several senses attached to them.9 These ‘semitechnical terms’10 used across science disciplines are known to pose difficulties for professionals who are not native speakers of English and whose scientific education was obtained in another language.11 The term screening can carry a specific or an abstract sense, as in ‘arrange for a screening tomorrow’ or ‘consider alternative screening methods for cancer’, apart from its other uses in healthcare: to mean ‘testing’ for a disease (eg, hepatitis), or ‘protection’ (shielding skin from harmful ultraviolet rays by using sunburn lotions), not to mention its everyday use to mean the ‘selection’ (choosing) of candidates for a job. Just which sense applies in a given context may not be obvious, especially to second-language users of English. Slight variations to the forms of words may refer to very different kinds of treatment, as with hormone (replacement) therapy and hormonal therapy, since they are normally used to distinguish interventions that supplement hormone production from those that block it. The significance of suffixes like –al on English adjectives (hormonal) may be overlooked by Chinese speakers because the Chinese language does not have an established pool of derivational suffixes for constructing words.12

Complex meanings are also embodied in English medical abbreviations such as FNA and DCIS. Although abbreviations are convenient shorthand for communication between specialists, many are specialty specific.13 Their meanings may, therefore, be obscure to other health professionals working on shifts in multidisciplinary hospital teams.14 This experimental British study found that hospital staff could decode only 58% of the abbreviations on patients' medical records. The barriers to understanding for patients are obvious.

Medical communication, face-to-face and via the internet

Language problems in medical communication have been noted in Australian government advisory documents to doctors, including the National Health and Medical Research Council's brochure on communicating with patients.15 Linguistic research has increasingly focused on medical communication, such as verbal exchanges in emergency settings16 and consultations between health professionals and patients.17 In medical consultations, doctors can repair any misunderstanding evident in their patients' responses, and patients do have the opportunity to ask questions, provided they have sufficient language and confidence to do so.

Health professionals may also refer patients to printed fact sheets or medical websites for fuller explanations. There, communication is strictly one way, and the providers of online health information are not always sensitive to the unseen reader. Their mode of communication can be opaque, as in the following example from a medical fact sheet intended for patients and carers:Effective treatment and management will involve psychosocial support and interventions that optimise quality of life and facilitate self-management.

That sentence makes dense and difficult reading because of its concentration of abstract nouns.18 The style is impersonal rather than engaging for patients and carers, whether English is their first or second language. The semitechnical terms used, for example, treatment, intervention and self-management, can obscure communication just as much as the more obvious technical term, psychosocial. Both add to the difficulty for readers, apart from the fact that the documents returned by internet search engines vary in their technicality, with little upfront indication of their intended readership. Wikipedia articles are not necessarily easy reading, as found in research on the quality of online cancer information.19 Although the accuracy of information in the cancer articles was affirmed, their readability was consistently lower than that of articles from a database maintained by medical professionals. Yet, articles from Wikipedia ranked among the top 10 items returned by 71%–85% of the search engines tested. Ease of access to the internet does not guarantee the accessibility of medical information delivered through it, for those with less than minimum literacy skills in English. Online health forums may help to mediate medical terminology as a kind of ‘oblique popularisation’,20 but their reach, scope and impact are at this stage uncharted.

How good are online health communicators generally at adapting their expression to make information accessible to non-specialist readers? That question motivated our research into a large body of online health documents to obtain a statistical profile of the language used, to see how appropriate it was for community readers and to determine whether the language of material intended for health professionals was significantly different from that intended for the general public.

Online health communication on breast cancer

As a case study of the language of online health information, we examined a set of documents on breast cancer. They were obtained from one of Macquarie University library's specialised online open access LibGuides.21 Many of these documents are for the general public, but others identify themselves as aimed at health professionals. This mixed body of material invited closer inspection of the language and terminology in the two types of document, to see how they differentiated themselves in relation to their intended readers. The documents were extracted into two separate databases (public and professional), based on any explicit indications of its addressees, and its general tenor: impersonal—pervasive use of third person (it, they), or personal—using first and second person as well as third (I/we, you, it, they). They were also distinguished by their respective text types and genres of publication:

  1. Public, consisting of texts intended for the general reader: fact sheets, some originally distributed in print; others written for the website as general advice to explain cancer treatments and anticipate patients’ questions.

  2. Professional, consisting of documents intended for health professionals, including journal articles, reviews and summaries of clinical research, clinical best practice, diagnosis and screening techniques.

The two databases were unequal in size, the professional database totalling 567 718 words, the public one 717 517 words. Subsets of each were used for the individual statistical analyses.

The language data were analysed with the WordSmith Tools (V.6.0) to extract word frequencies and keyword lists, and to generate statistical summaries.22 Word lists were preprocessed using a stop list to exclude high-frequency function words such as and, of and the, so that the relative frequencies of significant content words and medical terminology could be calculated. WordSmith's keyword function was used to identify the words and terms whose presence is most salient for professional discourse and the extent to which it differs from the public database. Type/token ratios were calculated to analyse the lexical variation and dispersion within each database.

Statistical profiles of cancer healthcare vocabularies in the professional and public databases

Relative frequencies of words and terms in the two databases

For the primary frequency analysis, a two-thirds subset of the public database (531 317 words) was created by omitting materials that were not originally written for the website, so as to make the public database approximately the same size as the professional database (567 718 words). A stop list was applied to both datasets to exclude function words, as indicated by the somewhat smaller total of words in the header of table 1. Lexical inventories of the remaining (content) words were then extracted by WordSmith, yielding 17 680 different words in the professional database, and 11 779 in the public database. The content words in the two types of document were ranked by their relative frequency, with the top 24 for each shown in table 1. Unsurprisingly, the highest ranked words in both databases were breast, cancer and women, representing the core substance of this medical area.

Table 1

Comparative rankings of top 24 words and terms in the two databases

Below the top three in each list, two very highly ranked items in both databases are risk and treatment, plus a few others with different rankings lower down the two lists (information, health, care and Australia). All these reflect the shared focus of the two types of document, but otherwise, the two vocabularies diverge, showing markedly different emphases. The professional list is stocked with medical terminology, including technical terms such as biopsy, carcinoma and clinical, and semitechnical ones such as evidence, practice, guidelines and screening. The latter represent the health professional's perspective, but they also embody the interdisciplinary language of science, as noted above. The public list highlights words such as help, support, family, feel, need and pain, as well as generic medical terms such as chemotherapy and side+effects, expressing the patient's concerns. The two lists highlight complementary aspects of healthcare and incidentally confirm the ecological validity of the data analysis. They also reflect the very different lexical demands of the two databases, particularly those presented by medical and scientific terms for the general reader.

Keywords in the two databases

Apart from their relative frequency in a given text, certain terms are particularly salient for their text type by their presence or absence. Their appearances, therefore, carry disproportionate challenges for readers unfamiliar with medical discourse, although the words and terms themselves may not be the most frequent in medical texts. These keywords in the professional database are identified by their relatively high frequencies in it, combined with their low frequencies in a larger independent database of everyday, non-specialised texts. The keyness values of words as calculated by WordSmith broadly reflect their over- or under-representation in a ‘study corpus’, using the larger and more general ‘reference corpus’ as the yardstick. To satisfy the statistical requirements, we sectioned off half the data from the professional database (from the documents on clinical best practice, the largest category within it), so as to create a study corpus of 263 568 words (reduced to 238 782 with the exclusion of stop words); and for the reference corpus we used the full quota of 717 517 words from the public database (reduced to 756 434 without stop words). By this, we created a 1:3 ratio between the two datasets, as in other comparative language research.23

Table 2 presents the extremes of positive and negative keyness for two sets of 12 words from the opposite ends of the range in the professional (study) corpus. Here, the words are ordered by their keyness values shown on the right-hand side of the table (from extremely positive in the top half of the table to extremely negative in the bottom half). Their raw frequencies in the smaller professional dataset are shown in the second column from the left and as a percentage of the total word count in the third. The raw frequencies of the same words in the reference (public) database are shown in the fourth column, and in the fifth column as a percentage of the overall wordage there, where it is 0.01 or greater. Though not shown, all keyness values are statistically significant (all p<0.001).

Table 2

Keyness values for terms and words at the extremes in the professional dataset

These two sets of keywords from opposite ends of the professional word list make interesting comparison in their significance for medical discourse. The high positive keyness of the medical term carcinoma contrasts with the low negative value for the general term cancer. Although cancer ranks equally high in both word lists in table 1 because of its overall frequency, its keyness value for professional medical discourse is very low, because it commonly appears in non-specialised documents. Also remarkable are the high keyness values of excluded and exclusion, indicating their regular use in evidence-based science and medical research, whereas their appearances in the public database are so few they fall below the percentage threshold to be reported. Similarly, the generic word researchers carries very low or negative keyness value for medical discourse, reflecting the importance of citing specific research studies. The high keyness value attached to patients (showing the professional concern with medical populations) contrasts with the low keyness values attached to words reflecting the support group for the individual patient, for whom family and the doctor are the key people in their situation. The individual perspective is also embedded in everyday words such as pain, feel and talk, among others that are negatively keyed. These complementary keyness values (positive and negative) reflect the focus of the professional dataset, emphasising the disciplinary aspects of scientific medicine that underwrite professional practice. Scientific literacy is assumed, as is appropriate when health professionals are addressing their peers. Yet, the individual concerns of patients are not often articulated in this online written material. This is not to say that doctors and health workers do not discuss patients' concerns in consultations with them. It does suggest that the online professional information which patients may encounter on the web would bypass their essential concerns.

Lexical variation and distribution

One other statistically based measure of the difference between the vocabularies of these two datasets is their respective type/token ratios. This is the ratio between the number of different words appearing in a text, and the total count of words in it. These offer a measure of the variation of vocabulary within a text or discourse, either concentrated use of a smaller everyday vocabulary (as in conversation), or a larger, more diverse vocabulary, as in informative writing.24 When the word tokens used are spread over a larger inventory of different word types, the type/token ratio is higher, with more types to tokens (ref. 25 p.52). Because type/token ratios vary with the volume of material,26 any intercomparisons need to be based on equivalent amounts of text. To do this, we used the smaller (two-thirds) version of the public corpus to match the whole of the professional corpus. As already noted, there was a marked contrast in the number of different content words or ‘types’ in the two datasets, with 17 680 types in the professional database, and 11 779 types in the public database, a clear indication of the much greater lexical demands of the professional dataset. This difference can also be expressed in WordSmith calculations of their overall type/token ratios: for the professional database it is 3.48, compared with 2.31 in the public database. The higher type/token ratio found for the professional dataset also reflects its larger range of technical and semitechnical terms and their dispersion through medical texts. It means that new, unprecedented types keep coming up to challenge readers, especially those with lower levels of literacy in English.

Medical and scientific terminology is beyond the ken of native English speakers without tertiary education and is likely to impact on patients' comprehension of their condition and treatment. Research on health literacy by Elder et al4 shows that patients who can pronounce medical terms correctly do not necessarily understand their meanings. Different understandings of polysemous words with medical applications can arise, as with practice, which may be taken by patients to refer to the doctor's surgery, rather than a recognised approach to treatment. The need to secure understanding of semitechnical terms in doctor–patient consultations is an under-recognised issue (ref 5, p.87). Such terminology also challenges health professionals whose scientific education was not in English.11 The risk of misunderstanding doubles when both the health professional and the patient are not native English speakers and are from different language backgrounds.

Addressing the language problem

This study of the language used in online breast cancer information offers quantitative measures of its linguistic demands on readers. Their understanding may be limited by their level of literacy in English, as clients of the healthcare system, or as non-native-English-speaking health workers within it—trainees and early career practitioners in health sciences. All need help with decoding medical information, such as a supportive online terminology dictionary, preferably with translation facilities in their own language. An online dictionary of more than 7000 cancer terms is attached to the US National Cancer Institute's website, but it lacks the equally problematic semitechnical terms we have discussed. It provides extended glossary-style definitions and recorded pronunciations of each term in American English with American-English phonetic transliterations.

An alternative approach to explaining cancer terminology is HealthTermFinder, an online dictionary-style information tool (or termbank) in breast cancer terminology, under construction at Macquarie University. The termbank contains medical terms relating to breast cancer pathology, as well as semitechnical terms used in its diagnosis and treatment, and pharmaceutical terms, both generic and proprietary names current in Australia. The termbank is designed according to best practice in pedagogical lexicography,27 and multimodal language learning,28 using verbal as well as audiovisual elements as complementary ways of communicating meaning. Plain language definitions (drafted by the first two authors of this paper) are provided for each term along with example sentences drawn from the databases, to show how the term combines with other words, and to add factual information about the experience of breast cancer. Terms contained within the examples are linked to their own pages within the termbank. Audio recordings of the term and its definition are included to aid those with low reading skills in English, whether they are patients, carers or health workers. Enlargeable graphics help to visualise the medical concept. Photographs such as the one on the termbank page for lymphoedema add a humanising dimension to the verbal information,29 reducing its abstractness, and investing it with individual relevance. The photograph complements the text, unlike line drawings or comic strip representations of disease, which necessarily introduce their own semiotics by their graphic styles and embedded medical narratives.29 ,30 HealthTermFinder will offer translations of the head term, its definition and the captions or labels on graphics, into five of the major immigrant languages used in Australia. The essential English content of each termbank page is shown in figure 1  (

Figure 1

Screenshot from the HealthTermFinder breast cancer termbank, showing features designed for second-language users of English and those with lower reading skills.

The primary content of HealthTermFinder (definitions, examples, images, tables) is all reviewed by medical consultants (including the fourth author of this paper), and the translations by two translators for each language. The termbank's usability and the readability of information on its pages will be tested by English-speaking and non-English-speaking clients at public health clinics, as well as non-native-English-speaking health professionals. Its availability online will ensure it supports them all in understanding cancer terminology wherever they are, and sits alongside whatever documents they are reading on health websites.


View Abstract


  • Contributors PP: conception and design of study; drafting of text and its revision. AS: substantial contribution to the statistical analyses and interpretation of lexical outputs. YF: substantial contribution to the acquisition and selection of material for the corpus, responsible for its structure. JB: reviewing paper for medical accuracy, and for overall intellectual content.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.