Algorithm
In [[mathematics]], [[computing]], [[linguistics]] and related disciplines, an '''algorithm''' is a sequence of instructions, often used for [[calculation]] and [[data processing]]. It is formally a type of [[effective method]] in which a list of well-defined instructions for completing a task will, when given an initial state, proceed through a well-defined series of successive states, eventually terminating in an end-state. The transition from one state to the next is not necessarily [[deterministic]]; some algorithms, known as [[probabilistic algorithms]], incorporate randomness.
A partial formalization of the concept began with attempts to solve the [[Entscheidungsproblem]] (the "decision problem") posed by [[David Hilbert]] in 1928. Subsequent formalizations were framed as attempts to define "[[effective calculability]]" (Kleene 1943:274) or "effective method" (Rosser 1939:225); those formalizations included the Gödel-Herbrand-Kleene [[Recursion (computer science)|recursive function]]s of 1930, 1934 and 1935, [[Alonzo Church]]'s [[lambda calculus]] of 1936, [[Emil Post]]'s "Formulation I" of 1936, and [[Alan Turing]]'s [[Turing machines]] of 1936-7 and 1939.
==Etymology==
[[Muhammad ibn Mūsā al-Khwārizmī|Al-Khwārizmī]], [[Persian people|Persian]] [[astronomer]] and [[mathematician]], wrote a [[treatise]] in [[Arabic]] in 825 AD, ''On Calculation with Hindu Numerals''. (See [[algorism]]). It was translated into [[Latin]] in the 12th century as ''Algoritmi de numero Indorum'' (al-Daffa 1977), which title was likely intended to mean "Algoritmi on the numbers of the Indians", where "Algoritmi" was the translator's rendition of the author's name; but people misunderstanding the title treated ''Algoritmi'' as a Latin plural and this led to the word "algorithm" (Latin ''algorismus'') coming to mean "calculation method". The intrusive "th" is most likely due to a [[false cognate]] with the [[Greek language|Greek]] {{lang|grc|ἀριθμός}} (''arithmos'') meaning "number".
== Why algorithms are necessary: an informal definition ==
No generally accepted ''formal'' definition of "algorithm" exists yet.
An informal definition could be "an algorithm is a computer program that calculates something." For some people, a program is only an algorithm if it stops eventually. For others, a program is only an algorithm if it stops before a given number of calculation steps.
A prototypical example of an "algorithm" is Euclid's algorithm to determine the maximum common divisor of two integers greater than one: "subtract the smallest number from the biggest one, repeat until you get a zero or a one". This procedure is know to stop always, and the number of subtractions needed is always smaller than the biggest of the two numbers.
We can derive clues to the issues involved and an informal meaning of the word from the following quotation from {{Harvtxt|Boolos|Jeffrey|1974, 1999}} (boldface added):
No human being can write fast enough, or long enough, or small enough to list all members of an enumerably infinite set by writing out their names, one after another, in some notation. But humans can do something equally useful, in the case of certain enumerably infinite sets: They can give '''explicit instructions for determining the nth member of the set''', for arbitrary finite n. Such instructions are to be given quite explicitly, in a form in which '''they could be followed by a computing machine''', or by a '''human who is capable of carrying out only very elementary operations on symbols'''
The words "enumerably infinite" mean "countable using integers perhaps extending to infinity". Thus Boolos and Jeffrey are saying that an algorithm ''implies'' instructions for a process that "creates" output integers from an ''arbitrary'' "input" integer or integers that, in theory, can be chosen from 0 to infinity. Thus we might expect an algorithm to be an algebraic equation such as '''y = m + n''' — two arbitrary "input variables" '''m''' and '''n''' that produce an output '''y'''. As we see in [[Algorithm characterizations]] — the word algorithm implies much more than this, something on the order of (for our addition example):
:Precise instructions (in language understood by "the computer") for a "fast, efficient, good" ''process'' that specifies the "moves" of "the computer" (machine or human, equipped with the necessary internally-contained information and capabilities) to find, decode, and then munch arbitrary input integers/symbols '''m''' and '''n''', symbols '''+''' and '''=''' ... and (reliably, correctly, "effectively") produce, in a "reasonable" [[time]], output-integer '''y''' at a specified place and in a specified format.
The concept of ''algorithm'' is also used to define the notion of [[decidability (logic)|decidability]]. That notion is central for explaining how [[formal system]]s come into being starting from a small set of [[axiom]]s and rules. In [[logic]], the time that an algorithm requires to complete cannot be measured, as it is not apparently related with our customary physical dimension. From such uncertainties, that characterize ongoing work, stems the unavailability of a definition of ''algorithm'' that suits both concrete (in some sense) and abstract usage of the term.
:''For a detailed presentation of the various points of view around the definition of "algorithm" see [[Algorithm characterizations]]. For examples of simple addition algorithms specified in the detailed manner described in [[Algorithm characterizations]], see [[Algorithm examples]].''
== Formalization of algorithms ==
Algorithms are essential to the way [[computer]]s process information, because a [[computer program]] is essentially an algorithm that tells the computer what specific steps to perform (in what specific order) in order to carry out a specified task, such as calculating employees’ paychecks or printing students’ report cards. Thus, an algorithm can be considered to be any sequence of operations that can be performed by a [[Turing completeness|Turing-complete]] system. Authors who assert this thesis include Savage (1987) and Gurevich (2000):
...Turing's informal argument in favor of his thesis justifies a stronger thesis: every algorithm can be simulated by a Turing machine (Gurevich 2000:1)...according to Savage [1987], an algorithm is a computational process defined by a Turing machine. (Gurevich 2000:3)
Typically, when an algorithm is associated with processing information, data are read from an input source or device, written to an output sink or device, and/or stored for further processing. Stored data are regarded as part of the internal state of the entity performing the algorithm. In practice, the state is stored in a [[data structure]], but an algorithm requires the internal data only for specific operation sets called [[abstract data type]]s.
For any such computational process, the algorithm must be rigorously defined: specified in the way it applies in all possible circumstances that could arise. That is, any conditional steps must be systematically dealt with, case-by-case; the criteria for each case must be clear (and computable).
Because an algorithm is a precise list of precise steps, the order of computation will almost always be critical to the functioning of the algorithm. Instructions are usually assumed to be listed explicitly, and are described as starting "from the top" and going "down to the bottom", an idea that is described more formally by ''[[control flow|flow of control]]''.
So far, this discussion of the formalization of an algorithm has assumed the premises of [[imperative programming]]. This is the most common conception, and it attempts to describe a task in discrete, "mechanical" means. Unique to this conception of formalized algorithms is the [[assignment operation]], setting the value of a variable. It derives from the intuition of "[[memory]]" as a scratchpad. There is an example below of such an assignment.
For some alternate conceptions of what constitutes an algorithm see [[functional programming]] and [[logic programming]] .
=== Termination ===
Some writers restrict the definition of ''algorithm'' to procedures that eventually finish. In such a category Kleene places the "''decision procedure'' or ''decision method'' or ''algorithm'' for the question" (Kleene 1952:136). Others, including Kleene, include procedures that could run forever without stopping; such a procedure has been called a "computational method" (Knuth 1997:5) or "''calculation procedure'' or ''algorithm''" (Kleene 1952:137); however, Kleene notes that such a method must eventually exhibit "some object" (Kleene 1952:137).
Minsky makes the pertinent observation, in regards to determining whether an algorithm will eventually terminate (from a particular starting state):
But if the length of the process is not known in advance, then "trying" it may not be decisive, because if the process does go on forever — then at no time will we ever be sure of the answer (Minsky 1967:105).
As it happens, no other method can do any better, as was shown by [[Alan Turing]] with his celebrated result on the undecidability of the so-called [[halting problem]]. There is no algorithmic procedure for determining of arbitrary algorithms whether or not they terminate from given starting states. The analysis of algorithms for their likelihood of termination is called [[termination analysis]].
See the examples of (im-)"proper" subtraction at [[partial function]] for more about what can happen when an algorithm fails for certain of its input numbers — e.g., (i) non-termination, (ii) production of "junk" (output in the wrong format to be considered a number) or no number(s) at all (halt ends the computation with no output), (iii) wrong number(s), or (iv) a combination of these. Kleene proposed that the production of "junk" or failure to produce a number is solved by having the algorithm detect these instances and produce e.g., an error message (he suggested "0"), or preferably, force the algorithm into an endless loop (Kleene 1952:322). Davis does this to his subtraction algorithm — he fixes his algorithm in a second example so that it is proper subtraction (Davis 1958:12-15). Along with the logical outcomes "true" and "false" Kleene also proposes the use of a third logical symbol "u" — undecided (Kleene 1952:326) — thus an algorithm will always produce ''something'' when confronted with a "proposition". The problem of wrong answers must be solved with an independent "proof" of the algorithm e.g., using induction:
We normally require auxiliary evidence for this (that the algorithm correctly defines a [[mu recursive function]]), e.g., in the form of an inductive proof that, for each argument value, the computation terminates with a unique value (Minsky 1967:186).
BI is sometimes used interchangeably with briefing books, report and query tools and executive information systems. In general, business intelligence systems are data-driven DSS.
BI systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a [[data warehouse]] or a [[data mart]] and occasionally working from operational data. Software elements support the use of this information by assisting in the extraction, analysis, and reporting of information. Applications tackle sales, production, financial, and many other sources of business data for purposes that include, notably, [[business performance management]]. Information may be gathered on comparable companies to produce [[benchmarking|benchmarks]].
==History==
Prior to the start of the [[Information Age]] in the late 20th century, businesses had to collect data from non-automated sources. Businesses then lacked the computing resources necessary to properly analyze the data, and as a result, companies often made business decisions primarily on the basis of [[intuition (knowledge)|intuition]].
As businesses automated systems the amount of data increased but its collection remained difficult due to the inability of information to be moved between or within systems. Analysis of information informed for long-term decision making, but was slow and often required the use of instinct or expertise to make short-term decisions. Business intelligence was defined in 1958 by [[Hans Peter Luhn]], who wrote,
In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera. The communication facility serving the conduct of a business (in the broad sense) may be referred to as an intelligence system. The notion of intelligence is also defined here, in a more general sense, as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."
In 1989 Howard Dresner, later a [[Gartner Group]] analyst, popularized BI as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems." In modern businesses the use of standards, automation and specialized software, including [[Online analytical processing|analytical tools]], allows large volumes of data to be [[Extract, transform, load|extracted, transformed, loaded]] and [[Data warehouse|warehoused]] to greatly increase the speed at which information becomes available for decision-making.
===Key intelligence topics===
Business intelligence often uses [[key performance indicators]] (KPIs) to assess the present state of business and to prescribe a course of action. Examples of KPIs are things such as lead conversion rate (in sales) and inventory turnover (in inventory management). Prior to the widespread adoption of computer and web applications, when information had to be manually input and calculated, performance data was often not available for weeks or months. Recently, banks have tried to make data available at shorter intervals and have reduced delays. The KPI methodology was further expanded with the Chief Performance Officer methodology which incorporated KPIs and root cause analysis into a single methodology.
Businesses that face higher operational/[[credit risk]] loading, such as [[credit card]] companies and "wealth management" services, often make KPI-related data available weekly. In some cases, companies may even offer a daily analysis of data. This fast pace requires analysts to use [[information technology|IT]] [[system]]s to process this large volume of data.
Chatterbot
A '''chatterbot''' (or chatbot) is a type of conversational agent, a [[computer program]] designed to simulate an intelligent [[conversation]] with one or more human users via auditory or textual methods. In other words, a chatterbot is a computer program with artificial intelligence to talk to people through voices or typed words. Though many appear to be intelligently interpreting the human input prior to providing a response, most chatterbots simply scan for keywords within the input and pull a reply with the most matching keywords or the most similar wording pattern from a local [[database]]. Chatterbots may also be referred to as ''talk bots'', ''chat bots'', or ''chatterboxes''.
== Method of operation ==
A good understanding of a conversation is required to carry on a meaningful dialog but most chatterbots do not attempt this. Instead they "converse" by recognizing cue words or phrases from the human user, which allows them to use pre-prepared or pre-calculated responses which can move the conversation on in an apparently meaningful way without requiring them to know what they are talking about.
For example, if a human types, "I am feeling very worried lately," the chatterbot may be programmed to recognize the phrase "I am" and respond by replacing it with "Why are you" plus a question mark at the end, giving the answer, "Why are you feeling very worried lately?" A similar approach using keywords would be for the program to answer any comment including ''(Name of celebrity)'' with "I think they're great, don't you?" Humans, especially those unfamiliar with chatterbots, sometimes find the resulting conversations engaging. Critics of chatterbots call this engagement the [[ELIZA effect]].
Some programs classified as chatterbots use other principles. One example is [[Jabberwacky]], which attempts to model the way humans learn new facts and language. [[Ellaz Systems|ELLA]] attempts to use [[natural language processing]] to make more useful responses from a human's input. Some programs that use natural language conversation, such as [[SHRDLU]], are not generally classified as chatterbots because they link their speech ability to knowledge of a simulated world. This type of link requires a more complex [[artificial intelligence]] (eg., a "vision" system) than standard chatterbots have.
== Early chatterbots ==
The classic early chatterbots are [[ELIZA]] and [[PARRY]]. More recent programs are [[Racter]], [[Verbot]]s, [[Artificial Linguistic Internet Computer Entity|A.L.I.C.E.]], and [[Ellaz Systems|ELLA]].
The growth of chatterbots as a research field has created an expansion in their purposes. While ELIZA and PARRY were used exclusively to simulate typed conversation, [[Racter]] was used to "write" a story called ''The Policeman's Beard is Half Constructed''. ELLA includes a collection of games and functional features to further extend the potential of chatterbots.
The term "ChatterBot" was coined by [[Michael Loren Mauldin|Michael Mauldin]] (Creator of the first [[Verbot]], Julia) in 1994 to describe these conversational programs.
== Malicious chatterbots ==
Malicious chatterbots are frequently used to fill chat rooms with spam and advertising, or to entice people into revealing personal information, such as bank account numbers. They are commonly found on [[Yahoo! Messenger]], [[Windows Live Messenger]], [[AOL Instant Messenger]] and other [[instant messaging]] protocols. There has been a published report of a chatterbot used in a fake personal ad on a dating service's website.
==Chatterbots in modern AI==
Most modern AI research focuses on practical engineering tasks. This is known as weak AI and is distinguished from [[strong AI]], which would require [[sapience]] and reasoning abilities.
One pertinent field of AI research is natural language. Usually weak AI fields employ specialised software or programming languages created for them. For example, one of the 'most-human' natural language chatterbots, [[Artificial Linguistic Internet Computer Entity|A.L.I.C.E.]], uses a programming language called AIML that is specific to its program, and its various clones, named Alicebots. Nevertheless, A.L.I.C.E. is still based on pattern matching without any reasoning. This is the same technique [[ELIZA]], the first chatterbot, was using back in 1966.
Australian company MyCyberTwin also deals in strong AI, allowing users to create and sustain their own virtual personalities online. MyCyberTwin.com also works in a corporate setting, allowing companies to set up Virtual AI Assistants. Another notable program, known as [[Jabberwacky]], also deals in strong AI, as it is claimed to learn new responses based on user interactions, rather than being driven from a static database like many other existing chatterbots. Although such programs show initial promise, many of the existing results in trying to tackle the problem of natural language still appear fairly poor, and it seems reasonable to state that there is currently no general purpose conversational artificial intelligence. This has led some software developers to focus more on the practical aspect of chatterbot technology - information retrieval.
A common rebuttal often used within the AI community against criticism of such approaches asks, "How do we know that humans don't also just follow some cleverly devised rules?" (in the way that Chatterbots do). Two famous examples of this line of argument against the rationale for the basis of the Turing test are John Searle's [[Chinese room]] argument and Ned Block's [[Intentional stance|Blockhead argument]].
==Chatterbots/Virtual Assistants in Commercial Environments==
Automated Conversational Systems have progressed and evolved far from the original designs of the first widely used chatbots. In the UK, large commercial entities such as Lloyds TSB, Royal Bank of Scotland, Renault, Citroën and One Railway are already utilizing Virtual Assistants to reduce expenditures on Call Centres and provide a first point of contact that can inform the user exactly of points of interest, provide support, capture data from the user and promote products for sale.
In the UK, new projects and research are being conducted to introduce a Virtual Assistant into the classroom to assist the teacher. This project is the first of its kind and the chatbot VA in question is based on the Yhaken [http://www.elzware.com] chatbot design.
The Yhaken template provides a further move forward in Automated Conversational Systems with features such as complex conversational routing and responses, well defined personality, a complex hierarchical construct with additional external reference points, emotional responses and in depth small talk, all to make the experience more interactive and involving for the user.
==Annual contests for chatterbots==
Many organizations tries to encourage and support developers all over the world to develop chatterbots that able to do variety of tasks and compete with each other through [[turing test]]s and more. Annual contests are organized at the following links:
*[http://www.chatterboxchallenge.com The Chatterbox Challenge]
*[http://www.loebner.net/Prizef/loebner-prize.html The Loebner Prize]
Computational linguistics
'''Computational linguistics''' is an [[interdisciplinary]] field dealing with the [[Statistics|statistical]] and/or rule-based modeling of [[natural language]] from a computational perspective. This modeling is not limited to any particular field of [[linguistics]]. Traditionally, computational linguistics was usually performed by [[computer scientist]]s who had specialized in the application of computers to the processing of a [[natural language]]. Recent research has shown that human language is much more complex than previously thought, so computational linguists often work as members of interdisciplinary teams, including linguists (specifically trained in linguistics), language experts (persons with some level of ability in the languages relevant to a given project), and computer scientists. In general computational linguistics draws upon the involvement of linguists, [[computer science|computer scientists]], experts in [[artificial intelligence]], [[cognitive psychology|cognitive psychologists]], [[math]]ematicians, and [[logic]]ians, amongst others.
==Origins==
Computational linguistics as a field predates [[artificial intelligence]], a field under which it is often grouped. Computational linguistics originated with efforts in the [[United States]] in the 1950s to use computers to automatically translate texts from foreign languages, particularly [[Russian language|Russian]] scientific journals, into English. Since computers had proven their ability to do [[arithmetic]] much faster and more accurately than humans, it was thought to be only a short matter of time before the technical details could be taken care of that would allow them the same remarkable capacity to process language.
When [[machine translation]] (also known as mechanical translation) failed to yield accurate translations right away, automated processing of human languages was recognized as far more complex than had originally been assumed. Computational linguistics was born as the name of the new field of study devoted to developing [[algorithm]]s and [[software]] for intelligently processing language data. When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehension and production of natural languages.
In order to translate one language into another, it was observed that one had to understand the [[grammar]] of both languages, including both [[morphology (linguistics)|morphology]] (the grammar of word forms) and [[syntax]] (the grammar of sentence structure). In order to understand syntax, one had to also understand the [[semantics]] and the [[lexicon]] (or 'vocabulary'), and even to understand something of the [[pragmatics]] of language use. Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process natural languages using computers.
==Subfields==
Computational linguistics can be divided into major areas depending upon the medium of the language being processed, whether spoken or textual; and upon the task being performed, whether analyzing language (recognition) or synthesizing language (generation). [[Speech recognition]] and [[speech synthesis]] deal with how spoken language can be understood or created using computers. Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together. Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages.
Some of the areas of research that are studied by computational linguistics include:
*Computer aided [[corpus linguistics]]
*Design of [[parser]]s or [[phrase chunking|chunkers]] for [[natural language]]s
*Design of taggers like [[Part-of-speech tagging|POS-taggers (part-of-speech taggers)]]
*Definition of specialized logics like resource logics for [[Natural language processing|NLP]]
*Research in the relation between formal and natural languages in general
*[[Machine translation]], e.g. by a translating computer
*[[Computational complexity]] of natural language, largely modeled on [[automata theory]], with the application of [[context-sensitive grammar]] and [[Linear bounded automaton|linearly-bounded]] [[Turing machine]]s.
The [[Association for Computational Linguistics]] defines computational linguistics as:
:...the scientific study of [[language]] from a computational perspective. Computational linguists are interested in providing [[computational model]]s of various kinds of linguistic phenomena.
Computer program
'''Computer programs''' (also '''[[Computer software|software programs]]''', or just '''programs''') are [[Instruction (computer science)|instructions]] for a [[computer]]. A computer requires programs to function, and a computer program does nothing unless its instructions are executed by a [[Central processing unit|central processor]]. Computer programs are usually [[executable]] programs or the [[source code]] from which executable programs are derived (e.g., [[compiler|compiled]]).
Computer source code is often written by professional [[computer programmer]]s. Source code is written in a [[programming language]] that usually follows one of two main [[Programming paradigm|paradigms]]: [[imperative programming|imperative]] or [[declarative language|declarative]] programming. Source code may be converted into an [[executable file]] (sometimes called an executable program) by a [[compiler]]. Alternatively, computer programs may be executed by a [[central processing unit]] with the aid of an [[Interpreter (computing)|interpreter]], or may be [[firmware|embedded]] directly into [[Computer hardware|hardware]].
Computer programs may be categorized along functional lines: [[system software]] and [[application software]]. And many computer programs may run simultaneously on a single computer, a process known as [[computer multitasking|multitasking]].
==Programming==
main()
{
output_string("Hello world!");
}
Source code of a program written in the [[C programming language]]
[[Computer programming]] is the iterative process of writing or editing [[source code]]. Editing source code involves testing, analyzing, and refining. A person who practices this skill is referred to as a computer [[programmer]] or software developer. The sometimes lengthy process of computer programming is usually referred to as [[software development]]. The term [[software engineering]] is becoming popular as the process is seen as an [[engineering]] discipline.
=== Paradigms ===
Computer programs can be categorized by the [[programming language]] [[Programming paradigm|paradigm]] used to produce them. Two of the main paradigms are [[imperative programming|imperative]] and [[declarative language|declarative]].
Programs written using an imperative language specify an [[algorithm]] using declarations, expressions, and statements. A declaration associates a [[variable]] name with a [[datatype]]. For example: var x: integer; . An expression yields a value. For example: 2 + 2 yields 4. Finally, a statement might assign an expression to a variable or use the value of a variable to alter the program's control flow. For example: x := 2 + 2; if x = 4 then do_something(); One criticism of imperative languages is the side-effect of an assignment statement on a class of variables called non-local variables.
Programs written using a declarative language specify the properties that have to be met by the output and do not specify any implementation details. Two broad categories of declarative languages are [[functional language]]s and [[logical language]]s. The principle behind functional languages (like [[Haskell (programming language)|Haskell]]) is to not allow side-effects, which makes it easier to reason about programs like mathematical functions. The principle behind logical languages (like [[Prolog]]) is to define the problem to be solved — the goal — and leave the detailed solution to the Prolog system itself. The goal is defined by providing a list of subgoals. Then each subgoal is defined by further providing a list of its subgoals, etc. If a path of subgoals fails to find a solution, then that subgoal is [[Backtracking|backtracked]] and another path is systematically attempted.
The form in which a program is created may be textual or visual. In a [[visual language]] program, elements are graphically manipulated rather than textually specified.
===Compilation or interpretation===
A ''computer program'' in the form of a [[human-readable]], computer programming language is called [[source code]]. Source code may be converted into an [[executable file|executable image]] by a [[compiler]] or executed immediately with the aid of an [[Interpreter (computing)|interpreter]].
Compiled computer programs are commonly referred to as executables, binary images, or simply as [[binary file|binaries]] — a reference to the [[binary numeral system|binary]] [[file format]] used to store the executable code. Compilers are used to translate source code from a programming language into either [[object file|object code]] or [[machine code]]. Object code needs further processing to become machine code, and machine code is the [[Central processing unit|Central Processing Unit]]'s native [[microcode|code]], ready for execution.
Interpreted computer programs are either decoded and then immediately executed or are decoded into some efficient intermediate representation for future execution. [[BASIC]], [[Perl]], and [[Python (programming language)|Python]] are examples of immediately executed computer programs. Alternatively, [[Java (programming language)|Java]] computer programs are compiled ahead of time and stored as a machine independent code called [[bytecode]]. Bytecode is then executed upon request by an interpreter called a [[virtual machine]].
The main disadvantage of interpreters is computer programs run slower than if compiled. Interpreting code is slower than running the compiled version because the interpreter must [[decode]] each [[Statement (programming)|statement]] each time it is loaded and then perform the desired action. On the other hand, software development may be quicker using an interpreter because testing is immediate when the compilation step is omitted. Another disadvantage of interpreters is the interpreter must be present on the computer at the time the computer program is executed. Alternatively, compiled computer programs need not have the compiler present at the time of execution.
No properties of a programming language require it to be exclusively compiled or exclusively interpreted. The categorization usually reflects the most popular method of language execution. For example, BASIC is thought of as an interpreted language and C a compiled language, despite the existence of BASIC compilers and C interpreters.
===Self-modifying programs===
A computer program in [[execution (computers)|execution]] is normally treated as being different from the [[data (computing)|data]] the program operates on. However, in some cases this distinction is blurred when a computer program modifies itself. The modified computer program is subsequently executed as part of the same program. [[Self-modifying code]] is possible for programs written in [[Lisp programming language|Lisp]], [[cobol|COBOL]], and [[Prolog]].
==Execution and storage==
Typically, computer programs are stored in [[non-volatile memory]] until requested either directly or indirectly to be [[execution (computers)|executed]] by the computer user. Upon such a request, the program is loaded into [[random access memory]], by a computer program called an [[operating system]], where it can be accessed directly by the central processor. The central processor then executes ("runs") the program, instruction by instruction, until termination. A program in execution is called a [[Process (computing)|process]]. Termination is either by normal self-termination or by error — software or hardware error.
===Embedded programs===
Some computer programs are embedded into hardware. A [[stored-program computer]] requires an initial computer program stored in its [[read-only memory]] to [[booting|boot]]. The boot process is to identify and initialize all aspects of the system, from [[Processor register|CPU registers]] to [[Device driver|device controllers]] to [[Volatile memory|memory]] contents. Following the initialization process, this initial computer program loads the [[operating system]] and sets the [[program counter]] to begin normal operations. Independent of the host computer, a [[Peripheral|hardware device]] might have embedded [[firmware]] to control its operation. Firmware is used when the computer program is rarely or never expected to change, or when the program must not be lost when the power is off.
===Manual programming===
Computer programs historically were manually input to the central processor via switches. An instruction was represented by a configuration of on/off settings. After setting the configuration, an execute button was pressed. This process was then repeated. Computer programs also historically were manually input via [[paper tape]] or [[punched cards]]. After the medium was loaded, the starting address was set via switches and the execute button pressed.
===Automatic program generation===
[[Generative programming]] is a style of [[computer programming]] that creates [[source code]] through [[generic programming|generic]] [[class (computer science)|classes]], [[Prototype-based programming|prototypes]], [[template (programming)|template]]s, [[aspect (computer science)|aspect]]s, and [[Code generation (compiler)|code generator]]s to improve [[programmer]] productivity. Source code is generated with [[programming tool]]s such as a [[template processor]] or an [[Integrated development environment|Integrated Development Environment]]. The simplest form of source code generator is a [[Macro (computer science)|macro]] processor, such as the [[C preprocessor]], which replaces patterns in source code according to relatively simple rules.
[[Software engine]]s output source code or [[Markup language|markup code]] that simultaneously become the input to another [[Process (computing)|computer process]]. The analogy is that of one process driving another process, with the computer code being burned as fuel. [[Application server]]s are software engines that deliver applications to [[client computer]]s. For example, a [[Wiki software|Wiki]] is an application server that allows users to build [[dynamic web page|dynamic content]] assembled from [[article (publishing)|articles]]. Wikis generate [[HTML]], [[CSS]], [[Java (programming language)|Java]], and [[Javascript]] which are then [[Interpreter (computing)|interpreted]] by a [[web browser]].
=== Simultaneous execution===
Many operating systems support [[computer multitasking|multitasking]] which enables many computer programs to appear to be running simultaneously on a single computer. Operating systems may run multiple programs through [[process scheduling]] — a software mechanism to [[Context switch|switch]] the CPU among processes frequently so that users can [[Time-sharing|interact]] with each program while it is running. Within hardware, modern day multiprocessor computers or computers with multicore processors may run multiple programs.
== Functional categories ==
Computer programs may be categorized along functional lines. These functional categories are [[system software]] and [[application software]]. System software includes the [[operating system]] which couples the [[computer hardware|computer's hardware]] with the application software. The purpose of the operating system is to provide an environment in which application software executes in a convenient and efficient manner. In addition to the operating system, system software includes [[Utility software|utility programs]] that help manage and tune the computer. If a computer program is not system software then it is application software. Application software includes [[middleware]], which couples the system software with the [[user interface]]. Application software also includes utility programs that help users solve application problems, like the need for sorting.
Computer science
'''Computer science''' (or '''computing science''') is the study and the [[science]] of the theoretical foundations of [[information]] and [[computation]] and their implementation and application in [[computer|computer system]]s. Computer science has many sub-fields; some emphasize the computation of specific results (such as [[computer graphics]]), while others relate to properties of [[computational problem]]s (such as [[computational complexity theory]]). Still others focus on the challenges in implementing computations. For example, [[programming language theory]] studies approaches to describing computations, while [[computer programming]] applies specific [[programming language]]s to solve specific computational problems. A further subfield, [[human-computer interaction]], focuses on the challenges in making computers and computations useful, usable and universally accessible to [[humans|people]].
== History ==
The early foundations of what would become computer science predate the invention of the modern [[digital computer]]. Machines for calculating fixed numerical tasks, such as the [[abacus]], have existed since antiquity. [[Wilhelm Schickard]] built the first mechanical calculator in 1623. [[Charles Babbage]] designed a [[difference engine]] in [[Victorian era|Victorian]] times (between 1837 and 1901) helped by [[Ada Lovelace]]. Around 1900, the [[IBM]] corporation sold [[Key_punch|punch-card machines]]. However, all of these machines were constrained to perform a single task, or at best some subset of all possible tasks.
During the 1940s, as newer and more powerful computing machines were developed, the term ''computer'' came to refer to the machines rather than their human predecessors. As it became clear that computers could be used for more than just mathematical calculations, the field of computer science broadened to study [[computation]] in general. Computer science began to be established as a distinct academic discipline in the 1960s, with the creation of the first computer science departments and degree programs. Since practical computers became available, many applications of computing have become distinct areas of study in their own right.
Many initially believed it impossible that "computers themselves could actually be a scientific field of study" (Levy 1984, p. 11), though it was in the "late fifties" (Levy 1984, p.11) that it gradually became accepted among the greater academic population. It is the now well-known IBM brand that formed part of the computer science revolution during this time. 'IBM' (short for International Business Machines) released the IBM 704 and later the IBM 709 computers, which were widely used during the exploration period of such devices. "Still, working with the IBM [computer] was frustrating...if you had misplaced as much as one letter in one instruction, the program would crash, and you would have to start the whole process over again" (Levy 1984, p.13). During the late 1950s, the computer science discipline was very much in its developmental stages, and such issues were commonplace.
Time has seen significant improvements in the useability and effectiveness of computer science technology. Modern society has seen a significant shift from computers being used solely by experts or professionals to a more widespread user base. By the 1990s, computers became accepted as being the norm within everyday life. During this time data entry was a primary component of the use of computers, many preferring to streamline their business practices through the use of a computer. This also gave the additional benefit of removing the need of large amounts of documentation and file records which consumed much-needed physical space within offices.
== Major achievements ==
Despite its relatively short history as a formal academic discipline, computer science has made a number of fundamental contributions to [[science]] and [[society]]. These include:
;Applications within computer science
* A formal definition of [[computation]] and [[computability]], and proof that there are computationally [[Undecidable problem|unsolvable]] and [[Intractable#Intractability|intractable]] problems.
* The concept of a [[programming language]], a tool for the precise expression of methodological information at various levels of abstraction.
;Applications outside of computing
* Sparked the [[Digital Revolution]] which led to the current [[Information Age]] and the [[Internet]].
* In [[cryptography]], [[Cryptanalysis of the Enigma|breaking the Enigma machine]] was an important factor contributing to the Allied victory in World War II.
* [[Scientific computing]] enabled advanced study of the mind and mapping the human genome was possible with [[Human Genome Project]]. [[Distributed computing]] projects like [[Folding@home]] explore [[protein folding]].
* [[Algorithmic trading]] has increased the [[Economic efficiency|efficiency]] and [[Market liquidity|liquidity]] of financial markets by using [[artificial intelligence]], [[machine learning]] and other [[statistics|statistical]] and [[Numerical analysis|numerical]] techniques on a large scale.
== Relationship with other fields ==
Despite its name, a significant amount of computer science does not involve the study of computers themselves. Because of this, several alternative names have been proposed. Danish scientist [[Peter Naur]] suggested the term ''datalogy'', to reflect the fact that the scientific discipline revolves around data and data treatment, while not necessarily involving computers. The first scientific institution to use the term was the Department of Datalogy at the University of Copenhagen, founded in 1969, with Peter Naur being the first professor in datalogy. The term is used mainly in the Scandinavian countries. Also, in the early days of computing, a number of terms for the and practitioners of the field of computing were suggested in the ''Communications are of the ACM''—''turingineer'', ''turologist'', ''flow-charts-man'', ''applied meta-mathematician'', and ''applied epistemologist''. Three months later in the same journal, ''comptologist'' was suggested, followed next year by ''hypologist''. Recently the term ''computics'' has been suggested. ''Informatik'' was a term used in Europe with more frequency.
The renowned computer scientist [[Edsger W. Dijkstra|Edsger Dijkstra]] stated, "Computer science is no more about computers than astronomy is about telescopes." The design and deployment of computers and computer systems is generally considered the province of disciplines other than computer science. For example, the study of [[computer hardware]] is usually considered part of [[computer engineering]], while the study of commercial [[computer system]]s and their deployment is often called [[information technology]] or [[information systems]]. Computer science is sometimes criticized as being insufficiently scientific, a view espoused in the statement "Science is to computer science as hydrodynamics is to plumbing", credited to [[Stan Kelly-Bootle]] and others. However, there has been much cross-fertilization of ideas between the various computer-related disciplines. Computer science research has also often crossed into other disciplines, such as [[cognitive science]], [[economics]], [[mathematics]], [[physics]] (see [[quantum computing]]), and [[linguistics]].
Computer science is considered by some to have a much closer relationship with [[mathematics]] than many scientific disciplines. Early computer science was strongly influenced by the work of mathematicians such as [[Kurt Gödel]] and [[Alan Turing]], and there continues to be a useful interchange of ideas between the two fields in areas such as [[mathematical logic]], [[category theory]], [[domain theory]], and [[algebra]].
The relationship between computer science and [[software engineering]] is a contentious issue, which is further muddied by [[Debates within software engineering|disputes]] over what the term "software engineering" means, and how computer science is defined. [[David Parnas]], taking a cue from the relationship between other engineering and science disciplines, has claimed that the principal focus of computer science is studying the properties of computation in general, while the principal focus of software engineering is the design of specific computations to achieve practical goals, making the two separate but complementary disciplines.
The academic, political, and funding aspects of computer science tend to have roots as to whether a department in the U.S. formed with either a mathematical emphasis or an engineering emphasis. In general, electrical engineering-based computer science departments have tended to succeed as computer science and/or engineering departments. Computer science departments with a mathematics emphasis and with a numerical orientation consider alignment [[computational science]]. Both types of departments tend to make efforts to bridge the field educationally if not across all research.
== Fields of computer science ==
Computer science searches for concepts and [[formal proof]]s to explain and describe computational systems of interest. As with all sciences, these theories can then be utilised to synthesize practical engineering applications, which in turn may suggest new systems to be studied and analysed. While the [[ACM Computing Classification System]] can be used to split computer science up into different topics of fields, a more descriptive breakdown follows:
=== Mathematical foundations ===
; [[Mathematical logic]]
: Boolean logic and other ways of modeling logical queries; the uses and limitations of formal proof methods.
; [[Number theory]]
: Theory of proofs and heuristics for finding proofs in the simple domain of integers. Used in [[cryptography]] as well as a test domain in [[artificial intelligence]].
; [[Graph theory]]
: Foundations for data structures and searching algorithms.
; [[Type theory]]
: Formal analysis of the types of data, and the use of these types to understand properties of programs, especially program safety.
; [[Category theory]]
: Category theory provides a means of capturing all of math and computation in a single synthesis.
; [[Computational geometry]]
: The study of [[algorithm]]s to solve problems stated in terms of [[geometry]].
; [[Numerical analysis]]
: Foundations for algorithms in discrete mathematics, as well as the study of the limitations of floating point computation, including [[round-off]] errors.
=== Theory of computation ===
; [[Automata theory]]
: Different logical structures for solving problems.
; [[Computability theory (computer science)|Computability theory]]
: What is calculable with the current models of computers. Proofs developed by [[Alan Turing]] and others provide insight into the possibilities of what can be computed and what cannot.
; [[Computational complexity theory]]
: Fundamental bounds (especially time and storage space) on classes of computations; in practice, study of which problems a computer can solve with reasonable resources (while computability theory studies which problems can be solved at all).
; [[Quantum computing|Quantum computing theory]]
: Representation and manipulation of data using the quantum properties of particles and quantum mechanism.
=== Algorithms and data structures ===
; [[Analysis of algorithms]]
: Time and space complexity of algorithms.
; [[Algorithms]]
: Formal logical processes used for computation, and the efficiency of these processes.
=== Programming languages and compilers ===
; [[Compiler]]s
: Ways of translating computer programs, usually from [[high-level programming language|higher level]] languages to [[low-level programming language|lower level]] ones.
; [[Interpreter (computing)|Interpreter]]s
: A program that takes in as input a computer program and executes it.
; [[Programming language]]s
: Formal language paradigms for expressing algorithms, and the properties of these languages (e.g., what problems they are suited to solve).
=== Concurrent, parallel, and distributed systems ===
; [[Concurrency (computer science)|Concurrency]]
: The theory and practice of simultaneous computation; data safety in any multitasking or multithreaded environment.
; [[Distributed computing]]
: Computing using multiple computing devices over a network to accomplish a common objective or task and thereby reducing the latency involved in single processor contributions for any task.
; [[Parallel computing]]
: Computing using multiple concurrent threads of execution.
=== Software engineering ===
; [[Algorithm design]]
: Using ideas from algorithm theory to creatively design solutions to real tasks
; [[Computer programming]]
: The practice of using a programming language to implement algorithms
; [[Formal methods]]
: Mathematical approaches for describing and reasoning about software designs.
; [[Reverse engineering]]
: The application of the scientific method to the understanding of arbitrary existing software
; [[Software development]]
: The principles and practice of designing, developing, and testing programs, as well as proper engineering practices.
=== System architecture ===
; [[Computer architecture]]
: The design, organization, optimization and verification of a computer system, mostly about [[CPU]]s and [[memory (computers)|memory]] subsystems (and the bus connecting them).
; [[Computer organization]]
: The implementation of computer architectures, in terms of descriptions of their specific [[electrical circuit]]ry
; [[Operating system]]s
: Systems for managing computer programs and providing the basis of a useable system.
=== Communications ===
; [[Computer audio]]
: Algorithms and data structures for the creation, manipulation, storage, and transmission of [[digital audio]] recordings. Also important in [[voice recognition]] applications.
; [[Computer networking|Networking]]
: Algorithms and protocols for communicating data across different shared or dedicated media, often including [[error correction]].
; [[Cryptography]]
: Applies results from complexity, probability and number theory to invent and break codes.
=== Databases ===
; [[Data mining]]
: Data mining is the extraction of relevant data from all sources of data.
; [[Relational databases]]
: Study of algorithms for searching and processing information in documents and databases; closely related to [[information retrieval]].
; [[OLAP]]
: Online Analytical Processing, or OLAP, is an approach to quickly provide answers to analytical queries that are multi-dimensional in nature. OLAP is part of the broader category [[business intelligence]], which also encompasses relational reporting and data mining.
=== Artificial intelligence ===
; [[Artificial intelligence]]
: The implementation and study of systems that exhibit an autonomous intelligence or behaviour of their own.
; [[Artificial life]]
: The study of digital organisms to learn about biological systems and evolution.
; [[Automated reasoning]]
: Solving engines, such as used in [[Prolog]], which produce steps to a result given a query on a fact and rule database.
; [[Computer vision]]
: Algorithms for identifying three dimensional objects from one or more two dimensional pictures.
; [[Machine learning]]
: Automated creation of a set of rules and axioms based on input.
; [[Natural language processing]]/[[Computational linguistics]]
: Automated understanding and generation of human language
; [[Robotics]]
: Algorithms for controlling the behavior of robots.
=== Visual rendering (or Computer graphics) ===
; [[Computer graphics]]
: Algorithms both for generating visual images synthetically, and for integrating or altering visual and spatial information sampled from the real world.
; [[Image processing]]
: Determining information from an image through computation.
=== Human-Computer Interaction ===
; [[Human computer interaction]]
: The study of making computers and computations useful, usable and universally accessible to [[user (computing)|people]], including the study and design of computer interfaces through which people use computers.
=== Scientific computing ===
; [[Bioinformatics]]
: The use of computer science to maintain, analyse, and store [[biological data]], and to assist in solving biological problems such as [[protein folding]], function prediction and [[phylogeny]].
; [[Cognitive Science]]
: Computational modelling of real minds
; [[Computational chemistry]]
: Computational modelling of theoretical chemistry in order to determine chemical structures and properties
; [[Computational neuroscience]]
: Computational modelling of real brains
; [[Computational physics]]
: Numerical simulations of large non-analytic systems
; [[Numerical analysis|Numerical algorithms]]
: Algorithms for the numerical solution of mathematical problems such as [[Root-finding algorithm|root-finding]], [[Numerical integration|integration]], the [[Numerical ordinary differential equations|solution of ordinary differential equations]] and the approximation/evaluation of [[special functions]].
; [[Symbolic mathematics]]
: Manipulation and solution of expressions in symbolic form, also known as [[Computer algebra]].
=== Didactics of computer science/informatics ===
The subfield didactics of computer science focuses on cognitive approaches of developing competencies of computer science and specific strategies for analysis, design, implementation and evaluation of excellent lessons in computer science.
== Computer science education ==
Some universities teach computer science as a theoretical study of computation and algorithmic reasoning. These programs often feature the [[theory of computation]], [[analysis of algorithms]], [[formal methods]], [[Concurrency (computer science)|concurrency theory]], [[databases]], [[computer graphics]] and [[systems analysis]], among others. They typically also teach [[computer programming]], but treat it as a vessel for the support of other fields of computer science rather than a central focus of high-level study.
Other colleges and universities, as well as [[secondary school]]s and vocational programs that teach computer science, emphasize the practice of advanced [[computer programming]] rather than the theory of algorithms and computation in their computer science curricula. Such curricula tend to focus on those skills that are important to workers entering the software industry. The practical aspects of computer programming are often referred to as [[software engineering]]. However, there is a lot of [[Debates within software engineering|disagreement]] over what the term "software engineering" actually means, and whether it is the same thing as programming.
Corpus linguistics
'''Corpus linguistics''' is the [[study of language]] as expressed in [[sample]]s ''([[Text corpus|corpora]])'' or "real world" text. This method represents a [[digest]]ive approach to deriving a set of abstract rules by which a [[natural language]] is governed or else relates to another language. Originally done by hand, corpora are largely derived by an automated process, which is corrected.
Computational methods had once been viewed as a [[holy grail]] of [[linguistics|linguistic]] research, which would ultimately manifest a [[ruleset]] for
[[natural language processing]] and [[machine translation]] at a high level. Such has not been the case, and since the [[cognitive revolution]], cognitive linguistics has been largely critical of many claimed practical uses for corpora. However, as [[computation]] capacity and speed have increased, the use of corpora to study language and term relationships en masse has gained some respectability.
The corpus approach runs counter to [[Noam Chomsky]]'s view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting.
Corpus linguistics does away with Chomsky's ''competence/performance'' split; adherents believe that reliable language analysis best occurs on field-collected samples, in natural contexts and with minimal experimental interference.
== History ==
A landmark in modern corpus linguistics was the publication by [[Henry Kucera]] and [[Nelson Francis]] of ''Computational Analysis of Present-Day American English'' in 1967, a work based on the analysis of the [[Brown Corpus]], a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, language teaching, [[psychology]], [[statistics]], and [[sociology]]. A further key publication was [[Randolph Quirk]]'s 'Towards a description of English Usage' (1960, Transactions of the Philological Society, 40-61) in which he introduced ''The Survey of English Usage''.
Shortly thereafter, Boston publisher [[Houghton-Mifflin]] approached Kucera to supply a million word, three-line citation base for its new ''[[The American Heritage Dictionary of the English Language|American Heritage Dictionary]]'', the first [[dictionary]] to be compiled using corpus linguistics. The AHD made the innovative step of combining prescriptive elements (how language ''should'' be used) with descriptive information (how it actually ''is'' used).
Other publishers followed suit. The British publisher Collins' [[COBUILD]] [[monolingual learner's dictionary]], designed for users learning [[English language learning and teaching|English as a foreign language]], was compiled using the [[Bank of English]].
The [[Brown Corpus]] has also spawned a number of similarly structured corpora: the [[LOB Corpus]] (1960s [[British English]]), Kolhapur ([[Indian English]]), Wellington ([[New Zealand English]]), Australian Corpus of English ([[Australian English]]), the Frown Corpus ([[early 1990s]] [[American English]]), and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties and modes, and include the [[International Corpus of English]], and the [[British National Corpus]], a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities ([[Oxford University|Oxford]] and [[Lancaster University|Lancaster]]) and the [[British Library]]. For contemporary American English, work has stalled on the [[American National Corpus]], but the 360 million word [[Corpus of Contemporary American English (COCA)]] (1990-present) is now available.
== Methods ==
This means dealing with real input data, where descriptions based on a linguist's intuition are not usually helpful.
Cross-platform
'''Cross-platform''' (also known as '''multi-platform''') is a term used in computing to refer to [[computer program]]s, [[operating system]]s, [[computer language]]s, [[programming language]]s, or other [[computer software]] and their implementations which can be made to work on multiple [[computer platform]]s. “Cross-platform” and “multi-platform” both refer to the idea that a given piece of computer software is able to be run on more than one computer platform. There are two major types of cross-platform software; one requires building for each platform that it supports (e.g., is written in a compiled language, such as [[Pascal (programming language)|Pascal]]), and the other one can be directly run on any platform which supports it (e.g., software written in an [[interpreted language]] such as [[Perl]], [[Python (programming language)|Python]], or [[shell script]]) or software written in a language which compiles to [[bytecode]] and the bytecode is redistributed (such as is the case with [[Java (programming language)|Java]] and languages used in the [[.NET Framework]]) such as [[Chrome (programming language)|Chrome]].
For example, a cross-platform [[application software|application]] may run on [[Microsoft Windows]] on the [[x86 architecture]], [[Linux]] on the [[x86 architecture]] and [[Mac OS X]] on either the [[PowerPC]] or [[x86]] based [[Apple Macintosh]] systems. A cross-platform [[application software|application]] may run on as many as all existing platforms, or on as few as two platforms.
== Platforms ==
A platform is a combination of hardware and software used to run software applications. A platform can be described simply as an operating system or computer architecture, or it could be the combination of both. Probably the most familiar platform is [[Microsoft Windows]] running on the [[x86 architecture]]. Other well-known desktop computer platforms include [[Linux]] and [[Mac OS X]] (both of which are themselves cross-platform). There are, however, many devices such as [[cellular telephones]] that are also effectively computer platforms but less commonly thought about in that way. [[Application software]] can be written to depend on the features of a particular platform—either the hardware, operating system, or virtual machine it runs on. The [[Java Platform|Java platform]] is a [[virtual machine]] platform which runs on many operating systems and hardware types, and is a common platform for software to be written for.
=== Hardware platforms ===
A '''hardware platform''' can refer to a computer’s [[computer architecture|architecture]] or [[processor architecture]]. For example, the [[x86]] and [[x86-64]] [[CPU]]s make up one of the most common [[computer architecture]]s in use in home machines today. These machines commonly run [[Microsoft Windows]], though they can run other [[operating system]]s as well, including [[Linux]], [[OpenBSD]], [[NetBSD]], [[Mac OS X]] and [[FreeBSD]].
=== Software platforms ===
Software platforms can either be an [[operating system]] or programming environment, though more commonly it is a combination of both. A notable exception to this is [[Java (programming language)|Java]], which uses an [[operating system]] independent [[virtual machine]] for its [[compiled]] code, known in the world of Java as [[bytecode]]. Examples of software platforms include:
* [[MS-DOS]] ([[x86]]), [[DR-DOS]] ([[x86]]), [[FreeDOS]] ([[x86]]) etc.
* [[Microsoft Windows]] ([[x86]], [[x64]])
* [[Linux]] (x86, x64, [[PowerPC]], various other architectures)
* [[Mac OS X]] (PowerPC, x86)
* [[OS/2]], [[eComStation]]
* [[AmigaOS]] ([[m68k]]), [[AROS]] (x86, PowerPC, m68k), [[MorphOS]] (PowerPC)
* [[Java (programming language)|Java]]
==== Java platform ====
As previously noted, the [[Java platform]] is an exception to the general rule that an [[operating system]] is a software platform. The Java language provides a [[virtual machine]], or a “virtual CPU” which runs all of the code that is written for the language. This enables the same [[executable]] [[binary file|binary]] to run on all systems which support the Java software, through the [[Java Virtual Machine]]. Java [[executable]]s do not run directly on the [[operating system]]; that is, neither [[Microsoft Windows|Windows]] nor [[Linux]] execute Java programs directly.
Because of this, however, Java is limited in that it does not directly support system-specific functionality. [[Java Native Interface|JNI]] can be used to access system specific functions, but then the code is likely no longer portable. Java programs can run on at least the [[Microsoft Windows]], [[Mac OS X]], [[Linux]], and [[Solaris Operating System|Solaris]] operating systems, and so the language is limited to functionality that exists on all these systems. This includes things such as [[computer networking]], [[Internet socket]]s, but not necessarily raw hardware [[input/output]].
== Cross-platform software ==
In order for software to be considered '''cross-platform''', it must be able to function on more than one [[computer architecture]] or [[operating system]]. This can be a time-consuming task given that different [[operating system]]s have different [[application programming interface]]s or [[application programming interface|API]]s (for example, [[Linux]] uses a different [[application programming interface|API]] for [[application software]] than [[Microsoft Windows|Windows]] does).
Just because a particular [[operating system]] may run on different [[computer architecture]]s, that does not mean that the software written for that operating system will automatically work on all [[computer architecture|architecture]]s that the operating system supports. One example as of August, 2006 was [[OpenOffice.org]], which did not natively run on the [[AMD64]] or [[EM64T]] lines of processors implementing the [[x86-64]] [[64-bit]] standards for computers; this has since been changed, and the OpenOffice.org suite of software is “mostly” ported to these 64-bit systems[http://wiki.services.openoffice.org/wiki/Porting_to_x86-64_(AMD64,_EM64T)]. This also means that just because a program is written in a popular programming language such as [[C (programming language)|C]] or [[C++]], it does not mean it will run on all [[operating systems]] that support that [[programming language]].
=== Web applications ===
[[Web application]]s are typically described as cross-platform because, ideally, they are accessible from any of various [[web browser]]s within different operating systems. Such applications generally employ a [[client-server]] system architecture, and vary widely in complexity and functionality. This wide variability significantly complicates the goal of cross-platform capability, which is routinely at odds with the goal of advanced functionality.
==== Basic applications ====
Basic web applications perform all or most processing from a [[Stateless server|stateless]] [[web server]], and pass the result to the client web browser. All user interaction with the application consists of simple exchanges of data requests and server responses. These types of applications were the norm in the early phases of [[World Wide Web]] application development. Such applications follow a simple [[Transaction processing|transaction]] model, identical to that of serving [[static web page]]s. Today, they are still relatively common, especially where cross-platform compatibility and simplicity are deemed more critical than advanced functionality.
==== Advanced applications ====
Prominent examples of advanced web applications include the Web interface to [[Gmail]], [[A9.com]], and the maps.live.com section of [[Live Search]]. Such advanced applications routinely depend on additional features found only in the more recent versions of popular web browsers. These dependencies include [[Ajax (programming)|Ajax]], [[JavaScript]], [[Dynamic HTML|“Dynamic” HTML]], [[SVG]], and other components of [[rich internet application]]s. Older versions of popular browsers tend to lack support for certain features.
==== Design strategies ====
Because of the competing interests of cross-platform compatibility and advanced functionality, numerous alternative web application design strategies have emerged.
Such strategies include:
=====Graceful degradation=====
Graceful degradation attempts to provide the same or similar functionality to all users and platforms, while diminishing that functionality to a ‘least common denominator’ for more limited client browsers. For example, a user attempting to use a limited-feature browser to access Gmail may notice that Gmail switches to “Basic Mode,” with reduced functionality. Some view this strategy as a lesser form of cross-platform capability.
=====Separation of functionality=====
Separation of functionality attempts to simply omit those subsets of functionality that are not capable from within certain client browsers or operating systems, while still delivering a ‘complete’ application to the user. (see also [[Separation of concerns]]).
=====Multiple codebase=====
Multiple codebase applications present different versions of an application depending on the specific client in use. This strategy is arguably the most complicated and expensive way to fulfill cross-platform capability, since even different versions of the same client browser (within the same operating system) can differ dramatically between each other. This is further complicated by the support for “plugins” which may or may not be present for any given installation of a particular browser version.
=====Third party libraries=====
Third party libraries attempt to simplify cross-platform capability by ‘hiding’ the complexities of client differentiation behind a single, unified API.
==== Testing strategies ====
One complicated aspect of cross-platform web application design is the need for [[software testing]]. In addition to the complications mentioned previously, there is the additional restriction that some browsers prohibit installation of different versions of the same browser on the same operating system. Techniques such as [[full virtualization]] are sometimes used as a workaround for this problem.
=== Traditional applications ===
Although web applications are becoming increasingly popular, many computer users still use traditional [[application software]] which does not rely on a client/web-server architecture.
The distinction between “traditional” and “web” applications is not always unambiguous, however, because applications have many different features, installation methods and architectures; and some of these can overlap and occur in ways that blur the distinction. Nevertheless, this simplifying distinction is a common and useful generalization.
==== Binary software ====
Traditionally in modern computing, application software has been distributed to end-users as '''binary images''', which are stored in [[executable]]s, a specific type of [[binary file]]. Such [[executable]]s only support the [[operating system]] and [[computer architecture]] that they were built for—which means that making a “cross-platform executable” would be something of a massive task, and is generally not done.
For software that is distributed as a [[binary file|binary]] [[executable]], such as software written in [[C (programming language)|C]] or [[C++]], the programmer must [[software build|build the software]] for each different [[operating system]] and [[computer architecture]]. For example, [[Mozilla]] [[Mozilla Firefox|Firefox]], an open-source web browser, is available on [[Microsoft Windows]], [[Mac OS X]] (both [[PowerPC]] and [[x86]] through something Apple calls a '''[[Universal binary]]'''), and [[Linux]] on multiple computer architectures. The three platforms (in this case, [[Microsoft Windows|Windows]], [[Mac OS X]], and [[Linux]]) are separate [[executable]] distributions, although they come from the same [[source code]].
In the context of binary software, cross-platform programs are written in the source code and then “translated” to each system that it runs on through compiling it on different platforms. Also, software can be [[porting|ported]] to a new [[computer architecture]] or [[operating system]] so that the program becomes more cross-platform than it already is. For example, a program such as Firefox, which already runs on Windows on the x86 family, can be modified and re-built to run on Linux on the x86 (and potentially other architectures) as well.
As an alternative to porting, cross-platform virtualization allows applications compiled for one CPU and operating system to run on a system with a different CPU and/or operating system, without modification to the source code or binaries. As an example, [[Apple Computer|Apple's]] [[Rosetta (software)|Rosetta]] software, which is built into [[Intel]]-based Apple Macintosh computers, runs applications compiled for the previous generation of Macs that used [[PowerPC]] CPUs. Another example is IBM PowerVM Lx86, which allows Linux/x86 applications to run unmodified on the Linux/Power operating system.
==== Scripts and [[interpreted language]]s ====
A script can be considered to be cross-platform if the [[scripting language]] is available on multiple platforms and the script only uses the facilities provided by the language. That is, a script written in [[Python (programming language)|Python]] for a [[Unix-like]] system will likely run with little or no modification on [[Microsoft Windows|Windows]], because Python also runs on [[Microsoft Windows|Windows]]; there is also more than one implementation of Python that will run the same scripts (e.g., [[IronPython]] for [[.NET Framework|.NET]]). The same goes for many of the [[open source]] [[programming language]]s that are available and are [[scripting language]]s.
Unlike [[binary file|binary]] [[executable]]s, the same script can be used on all computers that have software to interpret the script. This is because the script is generally stored in [[plain text]] in a [[text file]]. There may be some issues, however, such as the type of [[newline|new line character]] that sits between the lines. Generally, however, little or no work has to be done to make a script written for one system, run on another.
Some quite popular cross-platform scripting or [[interpreted language]]s are:
* [[bash]]—A [[Unix shell]] commonly run on [[Linux]] and other modern [[Unix-like]] systems, as well as on [[Microsoft Windows|Windows]] via the [[Cygwin]] [[POSIX]] compatibility layer.
* [[Python (programming language)|Python]]—A modern [[scripting language]] where the focus is on [[rapid application development]] and ease-of-writing, instead of program run-time efficiency.
* [[Perl]]—A scripting language first released in 1987. Used for [[Common Gateway Interface|CGI]] [[WWW]] programming, small [[system administration]] tasks, and more.
* [[PHP]]—A [[scripting language]] most popular in use on the [[WWW]] for [[web application]]s.
* [[Ruby (programming language)|Ruby]]—A scripting language who's purpose is to be object-oriented and easy to read. Can also be used on the web through [[Ruby on Rails]].
* [[Tcl]] - A dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.
==== Video games ====
Cross-platform is a term that can also apply to [[video game]]s. Such games are released on a range of [[video game console]]s and [[handheld game console]]s, which are specialized [[computer]]s dedicated to the task of playing games (and thus, are a platform as any other computer). Examples of these games include:
* [[Miner 2049er]], the first major multiplatform game
* [[Phantasy Star Online]]
* [[Lara Croft Tomb Raider: Legend]]
* [[FIFA Series]]
* [[Shadow of Legend]]
… which are spread across a variety of platforms, such as the [[Nintendo GameCube]], [[PlayStation 2]], [[Xbox]], [[Personal computer|PC]], and [[mobile devices]].
In some cases, depending on the hardware of a particular system it may take longer than expected to create a video game across multiple platforms. So, a video game may only get released on a few platforms and then later released on the remaining platforms. Typically, this is what occurs when a new system is released, because the [[Video game developer|developer]]s of the video game need to become acquainted with the hardware and software associated with the new console.
Some games may not become cross-platform because of licensing agreements between the [[Video game developer|developer]]s and the maker of the [[video game console]] which state that the game will only be made for one particular console. As an example, [[Disney]] could create a new game and wish to release it on the latest [[Nintendo]] and [[Sony]] game consoles. If [[Disney]] licenses the game with [[Sony]] first, [[Disney]] may be required to only release the game on [[Sony|Sony’s]] console for a short time, or indefinitely—effectively prohibiting the game from cross-platform at least for a period of time.
Several developers have developed ways to play games online while using different platforms. Epic Games, Microsoft and Valve Software all have this technology, that allows Xbox 360 gamers and PS3 gamers to play with PC gamers, allowing gamers to finally decide which platform is the best for a game. The first game released to allow this interactivity between PC and Console games was [[Quake 3]].
Games that feature cross-platform online play include:
*[[Champions Online]]
*[[Lost Planet: Colonies]]
*[[Phantasy Star Online]]
*[[Shadowrun (2007 video game)|Shadowrun]]
*[[UNO (Xbox Live Arcade)|UNO]]
*[[Final Fantasy XI Online]]
== Platform independent software ==
Software that is platform independent does not rely on any special features of any single platform, or, if it does, handles those special features such that it can deal with multiple platforms. All [[algorithm]]s, such as the [[quicksort]] algorithm, are able to be implemented on different platforms.
== Cross-platform programming ==
Cross-platform programming is the practice of actively writing software that will work on more than one platform.
=== Approaches to cross-platform programming ===
There are different ways of approaching the problem of writing a cross-platform application program. One such approach is simply to create multiple versions of the same program in different ''source trees''—in other words, the [[Microsoft Windows|Windows]] version of a program might have one set of source code files and the [[Apple Macintosh|Macintosh]] version might have another, while a FOSS *nix system might have another. While this is a straightforward approach to the problem, it has the potential to be considerably more expensive in development cost, development time, or both, especially for the corporate entities. The idea behind this is to create more than two different programs that have the ability to behave similarly to each other. It is also possible that this means of developing a cross-platform application will result in more problems with bug tracking and fixing, because the two different ''source trees'' would have different programmers, and thus different defects in each version. The smaller the programming team, the quicker the bug fixes tend to be.
Another approach that is used is to depend on pre-existing software that hides the differences between the platforms—called [[abstraction]] of the platform—such that the program itself is unaware of the platform it is running on. It could be said that such programs are ''platform agnostic''. Programs that run on the [[Java (Sun)|Java]] [[Virtual Machine]] ([[Java Virtual Machine|JVM]]) are built in this fashion.
Some applications mix various methods of cross-platform programming to create the final application. An example of this is the [[Firefox]] [[web browser]], which uses [[abstraction]] to build some of the lower-level components, separate source subtrees for implementing platform specific features (like the GUI), and the implementation of more than one [[scripting language]] to help facilitate ease of portability. [[Firefox]] implements [[XUL]], [[Cascading Style Sheets|CSS]] and [[JavaScript]] for extending the browser, in addition to classic [[Netscape]]-style browser plugins. Much of the browser itself is written in XUL, CSS, and JavaScript, as well.
=== Cross-platform programming toolkits ===
There are a number of tools which are available to help facilitate the process of cross-platform programming:
* [[Simple DirectMedia Layer]]—An [[open source]] cross-platform multimedia library written in C that creates an abstraction over various platforms’ graphics, sound, and input [[Application programming interface|API]]s. It runs on many operating systems including Linux, Windows and Mac OS X and is aimed at games and multimedia applications.
* [[Cairo (graphics)|Cairo]]−A [[free software]] library used to provide a vector graphics-based, device-independent API. It is designed to provide primitives for 2-dimensional drawing across a number of different backends. Cairo is written in C and has bindings for many programming languages.
* ''ParaGUI''—ParaGUI is a cross-platform high-level application framework and GUI library. It can be compiled on various platforms(Linux, Win32, BeOS, Mac OS, ...). ParaGUI is based on the Simple DirectMedia Layer (SDL). ParaGUI is targeted on crossplatform multimedia applications and embedded devices operating on framebuffer displays.
* [[wxWidgets]]—An open source widget toolkit that is also an [[application framework]]. It runs on [[Unix-like]] systems with [[X11]], Microsoft Windows and Mac OS X. It permits applications written to use it to run on all of the systems that it supports, if the application does not use any [[operating system]]-specific programming in addition to it.
* [[Qt (toolkit)|Qt]]—An application framework and [[widget toolkit]] for [[Unix-like]] systems with [[X11]], Microsoft Windows, Mac OS X, and other systems—available under both [[open source]] and commercial licenses.
* [[GTK+]]—An open source widget toolkit for Unix-like systems with X11 and Microsoft Windows.
* [[FLTK]]—Another open source cross platform toolkit, but more light weight because it restricts itself to the GUI.
* [[Mozilla application framework|Mozilla]]—An open source platform for building Mac, Windows and Linux applications.
* [[Mono (software)|Mono]] (and more specifically, [[Microsoft .NET]])—A cross-platform framework for applications and programming languages.
* ''molib''—A robust commercial application toolkit library that abstracts the system calls through C++ objects (such as the file system, database system and thread implementation.). This allows for the creation of applications that compile and run under Microsoft Windows, Mac OS X, GNU/Linux, and other uses (Sun OS, AIX, HP-UX, 32/64 bit, SMP). Use in concert with ''the sandbox'' to create GUI-based applications.
* [[fpGUI]] - An open source widget toolkit that is completely implemented in Object Pascal. It currently supports Linux, Windows and a bit of Windows CE. fpGUI does not rely on any large libraries, instead it talks directly to Xlib (Linux) or GDI (Windows). The framework is compiled with the Free Pascal compiler. Mac OS support is also in the works.
* [[Tcl/Tk]] - Tcl (Tool Command Language) is a dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more. Open source and business-friendly, Tcl is a mature yet evolving language that is truly cross platform, easily deployed and highly extensible. Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches. Tk is the standard GUI not only for Tcl, but for many other dynamic languages, and can produce rich, native applications that run unchanged across Windows, Mac OS X, Linux and more. The combination of Tcl and the Tk GUI toolkit is referred to as Tcl/Tk.
* [[XVT]] is a cross-platform toolkit for creating enterprise and desktop applications in C/C++ on Windows, Linux and Unix (Solaris, HPUX, AIX), and Mac. Most recent release is 5.8, in April 2007
=== Cross-platform development environments ===
Cross-platform applications can also be built using proprietary [[Integrated development environment|IDE]]s, or so-called [[Rapid Application Development]] tools. There are a number of development environments which allow developers to build and deploy applications across multiple platforms:
* [[Eclipse (software)| Eclipse]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins including the C++ Development Toolkit. Eclipse is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).
* [[IntelliJ IDEA]]—A proprietary [[Integrated development environment|IDE]]
* [[NetBeans]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins. NetBeans is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems). Similar to Eclipse in features and functionality. Promoted by [[Sun Microsystems]]
* [[Omnis Studio]]—A proprietary [[Integrated development environment|IDE]] or Rapid Application Development tool for creating enterprise and web applications for Windows, Linux, and Mac OS X.
* [[Runtime Revolution]]—a proprietary [[Integrated development environment|IDE]], compiler engine and CGI builder that [[cross compile]]s to [[Microsoft Windows|Windows]], [[Mac OS X]] ([[PowerPC|PPC]], [[Intel]]), [[Linux]], [[Solaris Operating System|Solaris]], [[BSD]], and [[Irix]].
*[[Code::Blocks]]—A free/open source, cross platform IDE. It is developed in C++ using wxWidgets. Using a plugin architecture, its capabilities and features are defined by the provided plugins.
*[[Lazarus (software)]]—Lazarus is a cross platform Visual IDE developed for and supported by the open source Free Pascal compiler. It aims to provide a Rapid Application Development Delphi Clone for Pascal and Object Pascal developers.
*[[REALbasic]]—REALbasic (RB) is an object-oriented dialect of the BASIC programming language developed and commercially marketed by REAL Software, Inc in Austin, Texas for Mac OS X, Microsoft Windows, and Linux.
== Criticisms of cross-platform development ==
There are certain issues associated with cross-platform development. Some of these include:
* Testing cross-platform applications may also be considerably more complicated, since different platforms can exhibit slightly different behaviors or subtle bugs. This problem has led some developers to deride cross-platform development as “Write Once, Debug Everywhere”, a take on Sun’s [[Write once, run anywhere|“Write Once, Run Anywhere”]] marketing slogan.
* Developers are often restricted to using the [[lowest common denominator]] subset of features which are available on all platforms. This may hinder the application's performance or prohibit developers from using platforms’ most advanced features.
* Different platforms often have different user interface conventions, which cross-platform applications do not always accommodate. For example, applications developed for Mac OS X and [[GNOME]] are supposed to place the most important button on the right-hand side of windows and dialogs, whereas Microsoft Windows and [[KDE]] have the opposite convention. Though many of these differences are subtle, a cross-platform application which does not conform appropriately to these conventions may feel clunky or alien to the user. When working quickly, such opposing conventions may even result in [[data loss]], such as in a [[dialog box]] confirming whether the user wants to save or discard changes to a file.
* Scripting languages and virtual machines must be translated into native executable code each time the application is executed, imposing a performance penalty. This performance hit can be alleviated using advanced techniques like [[just-in-time compilation]]; but even using such techniques, some performance overhead may be unavoidable.
Data
'''Data''' (singular: '''datum''') are collected of natural phenomena descriptors including the results of [[experience]], [[observation]] or [[experiment]], or a set of [[premise]]s. This may consist of [[number]]s, [[word]]s, or [[image]]s, particularly as [[measurement]]s or observations of a set of [[variable]]s.
==Etymology==
The word ''data ''is the plural of [[Latin]] ''[[datum]]'', [[Grammatical gender|neuter]] past [[participle]] of ''dare'', "to give", hence "something given". The [[past participle]] of "to give" has been used for millennia, in the sense of a statement accepted at face value; one of the works of [[Euclid]], circa 300 BC, was the ''Dedomena'' (in Latin, ''Data''). In discussions of problems in [[geometry]], [[mathematics]], [[engineering]], and so on, the terms ''givens'' and ''data'' are used interchangeably. Such usage is the origin of ''data'' as a concept in [[computer science]]:'' ''data'' ''are numbers, words, images, etc., accepted as they stand. Pronounced dey-tuh, dat-uh, or dah-tuh.''
[[Experimental data]] are data generated within the context of a scientific investigation. Mathematically, data can be grouped in many ways.
==Usage in English==
In [[English language|English]], the word ''datum'' is still used in the general sense of "something given", and more specifically in [[cartography]], [[geography]], [[geology]], [[NMR]] and [[technical drawing|drafting]] to mean a reference point, reference line, or reference surface. More generally speaking, any measurement or result can be called a (single) ''datum'', but ''data point'' is more common. Both ''datums'' (see usage in [[datum]] article) and the originally Latin plural ''data'' are used as the plural of ''datum'' in English, but ''data'' is more commonly treated as a [[mass noun]] and used in the [[Grammatical number|singular]], especially in day-to-day usage. For example, "This is all the data from the experiment". This usage is inconsistent with the rules of Latin grammar and traditional English, which would instead suggest "These are all the data from the experiment". Some British and UN academic, scientific, and professional [[style guides]] (e.g., see page 43 of the [http://whqlibdoc.who.int/hq/2004/WHO_IMD_PUB_04.1.pdf World Health Organization Style Guide]) request that authors treat ''data'' as a plural noun. Other international organization, such as the IEEE computing society , allow its usage as either a mass noun or plural based on author preference. It is now usually treated as a singular mass noun in informal usage, but usage in scientific publications shows a strong UK/U.S divide. U.S. usage tends to treat ''data'' in the singular, including in serious and academic publishing, although some major newspapers (such as the [[New York Times]]) regularly use it in the plural. "The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.” Sometimes scientists think of data as plural, as in ''These data do not support the conclusions.'' But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage."[http://www.bartleby.com/61/51/D0035100.html] UK usage now widely accepts treating ''data'' as singular in standard English, including everyday newspaper usage at least in non-scientific use. UK scientific publishing usually still prefers treating it as a plural.. Some UK university style guides recommend using ''data'' for both singular and plural use and some recommend treating it only as a singular in connection with computers.
==Uses of ''data'' in science and computing==
''Raw data'' are [[number]]s, [[character (computing)|characters]], [[image]]s or other outputs from devices to convert physical quantities into symbols, in a very broad sense. Such data are typically further [[data processing|processed]] by a human or [[input]] into a [[computer]], [[Computer storage|stored]] and processed there, or transmitted ([[output]]) to another human or computer. ''Raw data'' is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.
Mechanical computing devices are classified according to the means by which they represent data. An [[analog computer]] represents a datum as a voltage, distance, position, or other physical quantity. A [[digital computer]] represents a datum as a sequence of symbols drawn from a fixed [[alphabet]]. The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.
Some special forms of data are distinguished. A [[computer program]] is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably [[Lisp programming language|Lisp]] and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish [[metadata]], that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books.
==Meaning of data, information and knowledge==
The terms [[information]] and [[knowledge]] are frequently used for overlapping concepts. The main difference is in the level of [[abstraction]] being considered. Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three. For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".
Information as a concept bears a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
Beynon-Davies uses the concept of a [[sign]] to distinguish between [[data]] and [[information]]. Data are symbols. Information occurs when symbols are used to refer to something.
Data analysis
'''Data analysis''' is the process of looking at and summarizing '''[[data]]''' with the intent to extract useful [[information]] and develop conclusions. Data analysis is closely related to [[data mining]], but data mining tends to focus on larger data sets, with less emphasis on making [[inference]], and often uses data that was originally collected for a different purpose. In [[statistics|statistical applications]], some people divide data analysis into [[descriptive statistics]], [[exploratory data analysis]] and [[confirmatory data analysis]], where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.
Data analysis assumes different aspects, and possibly different names, in different fields.
The term ''data analysis'' is also used as a synonym for [[data modeling]], which is unrelated to the subject of this article.
==Nuclear and particle physics==
In [[nuclear physics|nuclear]] and [[particle physics]] the data usually originate from the [[particle detector|experimental apparatus]] via a [[data acquisition]] system. It is then processed, in a step usually called ''data reduction'', to apply calibrations and to extract physically significant information. Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. The resulting data ''n-tuples'' are then scrutinized by the physicists, using specialized software tools like [[ROOT]] or [[Physics Analysis Workstation|PAW]], comparing the results of the experiment with theory.
The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for [[Monte Carlo method|Monte Carlo simulation]] software like [[Geant4]] that predict the response of the detector to a given theoretical event, producing '''simulated events''' which are then compared to experimental data.
See also: [[Computational physics]].
==Social sciences==
[[Qualitative data analysis]] (QDA) or [[qualitative research]] is the analysis of non-numerical data, for example words, photographs, observations, etc..
==Information technology==
A special case is the [[Data analysis (information technology in othm )|data analysis in information technology audits]].
==Business==
See
* [[Analytics]]
* [[Business intelligence]]
* [[Data mining]]
Database
A '''database''' is a [[structure]]d collection of records or [[data]]. A [[computer]] database relies upon [[software]] to organize the storage of data. The software models the database structure in what are known as [[database model]]s. The model in most common use today is the [[relational model]]. Other models such as the [[hierarchical model]] and the [[network model]] use a more explicit representation of relationships (see below for explanation of the various database models).
Database management systems (DBMS) are the software used to organize and maintain the database. These are categorized according to the [[database model]] that they support. The model tends to determine the query languages that are available to access the database. A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from [[hardware failure]]s. In these areas there are large differences between products.
==History==
The earliest known use of the term '''''data base''''' was in November 1963, when the [[System Development Corporation]] sponsored a symposium under the title ''Development and Management of a Computer-centered Data Base''. '''Database''' as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers. (The abbreviation DB, however, survives.)
The first database management systems were developed in the 1960s. A pioneer in the field was [[Charles Bachman]]. Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on [[punch card|punched cards]] and [[magnetic tape]], so that serial processing was the dominant activity. Two key [[data model]]s arose at this time: [[CODASYL]] developed the [[network model]] based on Bachman's ideas, and (apparently independently) the [[hierarchical model]] was used in a system developed by [[North American Rockwell]] later adopted by [[IBM]] as the cornerstone of their [[Information Management System|IMS]] product. While IMS along with the CODASYL [[IDMS]] were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today. Two worthy of mention are the [[Pick operating system|PICK]] and [[MUMPS]] databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems.
The [[relational model]] was proposed by [[Edgar F. Codd|E. F. Codd]] in 1970. He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms. For a long while, however, the relational model remained of academic interest only. While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time. Among the first implementations were [[Michael Stonebraker]]'s [[Ingres (database)|Ingres]] at [[University of California, Berkeley|Berkeley]], and the [[System R]] project at IBM. Both of these were research prototypes, announced during 1976. The first commercial products, [[Oracle database|Oracle]] and [[IBM DB2|DB2]], did not appear until around 1980. The first successful database product for microcomputers was [[dBASE]] for the [[CP/M]] and [[PC-DOS]]/[[MS-DOS]] operating systems.
During the 1980s, research activity focused on [[distributed database]] systems and [[database machine]]s. Another important theoretical idea was the [[Functional Data Model]], but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.
In the 1990s, attention shifted to [[OODB|object-oriented databases]]. These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as [[spatial database]]s, engineering data (including software [[Software repository|repositories]]), and multimedia data. Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result. The 1990s also saw the spread of [[Open Source]] databases, such as [[PostgreSQL]] and [[MySQL]].
In the 2000s, the fashionable area for innovation is the [[XML database]]. As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products. [[XML databases]] aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.
==Database models==
Various techniques are used to model data structure. Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model. For any one [[logical model]] various physical implementations may be possible, and most products will offer the user some level of control in tuning the [[physical implementation]], since the choices that are made have a significant effect on performance. Here are three examples:
===Hierarchical model===
In a [[hierarchical model]], data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list. This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files. Each unit in the model is a record which is also known as a node. In such a model, each record on one level can be related to multiple records on the next lower level. A record that has subsidiary records is called a parent and the subsidiary records are called children. Data elements in this model are well suited for one-to-many relationships with other data elements in the database.
This model is advantageous when the data elements are inherently hierarchical. The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated. Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization.
===Network model===
The [[network model]] tends to store records with links to other records. Each record in the database can have multiple parents, i.e., the relationships among data elements can have a many to many relationship. Associations are tracked via "pointers". These pointers can be node numbers or disk addresses. Most network databases tend to also include some form of hierarchical model. Databases can be translated from hierarchical model to network and vice versa. The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent.
The network model provides greater advantage than the hierarchical model in that promotes greater flexibility and data accessibility, since records at a lower level can be accessed without accessing the records above them. This model is more efficient than hierarchical model, easier to understand and can be applied to many real world problems that require routine transactions. The disadvantages are that: It is a complex process to design and develop a network database; It has to be refined frequently; It requires that the relationships among all the records be defined before development starts, and changes often demand major programming efforts; Operation and maintenance of the network model is expensive and time consuming.
Examples of database engines that have network model capabilities are [[RDM Embedded]] and [[RDM Server]].
===Relational model===
The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows. The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number). Rows (also called records) represent instances of an entity (e.g. specific employees).
The "relation" in "relational database" comes from the mathematical notion of [[Relation (mathematics)|relations]] from the field of [[set theory]]. A relation is a set of [[tuple]]s, so rows are sometimes called tuples. All tables in a relational database adhere to three basic rules.
* The ordering of columns is immaterial
* Identical rows are not allowed in a table
* Each row has a single (separate) value for each of its columns (each tuple has an atomic value).
If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records. Relationships between records are often categorized by their [[Cardinality (data modeling)|cardinality]] (1:1, (0), 1:M, M:M).
Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values. A "primary key" is a key that has a unique value for each row in the table. Keys are commonly used to join or combine data from two or more tables. For example, an ''employee'' table may contain a column named ''address'' which contains a value that matches the key of an ''address'' table. Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables. It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.
====Relational operations====
Users (or programs) request data from a relational database by sending it a [[query]] that is written in a special language, usually a dialect of [[SQL]]. Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface. Many web applications, such as [[Wikipedia]], perform SQL queries when generating pages.
In response to a query, the database returns a result set, which is the list of rows constituting the answer. The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted. Often, data from multiple tables are combined into one, by doing a [[Join (SQL)|join]]. There are a number of relational operations in addition to join.
====Normal forms====
Relations are classified based upon the types of anomalies to which they're vulnerable. A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies. Normal forms are hierarchical in nature. That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal form.
==Database Management Systems==
===Relational database management systems===
An RDBMS implements the features of the relational model outlined above. In this context, [[Christopher J. Date|Date]]'s '''Information Principle''' states:
The entire information content of the database is represented in one and only one way. Namely as explicit values in column positions (attributes) and rows in relations ([[tuple]]s) Therefore, there are no explicit pointers between related tables.
===Post-relational database models===
Several products have been identified as [[post-relational]] because the data model incorporates [[relations]] but is not constrained by the Information Principle, requiring that all information is represented by [[data values]] in relations. Products using a post-relational data model typically employ a model that actually pre-dates the [[relational model]]. These might be identified as a [[directed graph]] with [[tree data structure|trees]] on the [[data structure|nodes]].
Examples of models that could be classified as post-relational are [[Pick operating system|PICK]] aka [[Multidimensional database|MultiValue]], and [[MUMPS]].
===Object database models===
In recent years, the [[object-oriented]] paradigm has been applied to database technology, creating a new programming model known as [[object database]]s. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same [[type system]] as the application program. This aims to avoid the overhead (sometimes referred to as the ''[[Object-Relational impedance mismatch|impedance mismatch]]'') of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as [[encapsulation]] and [[polymorphism (computer science)|polymorphism]], into the world of databases.
A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program [[Persistence (computer science)|persistent]]. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.
==DBMS internals==
===Storage and physical database design===
Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered [[flat file database|flat files]], [[ISAM]], [[heap (data structure)|heaps]], [[hash table|hash buckets]] or [[B+ tree]]s. These have various advantages and disadvantages discussed further in the main article on this topic. The most commonly used are B+ trees and ISAM.
Other important design choices relate to the clustering of data by category (such as grouping data by month, or location), creating pre-computed views known as materialized views, partitioning data by range or hash. As well memory management and storage topology can be important design choices for database designers. Just as normalization is used to reduce storage requirements and improve the extensibility of the database, conversely denormalization is often used to reduce join complexity and reduce execution time for queries.
====Indexing====
All of these databases can take advantage of [[Index (database)|indexing]] to increase their speed. This technology has advanced tremendously since its early uses in the 1960s and 1970s. The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value. An index allows a set of table rows matching some criterion to be located quickly. Typically, indexes are also stored in the various forms of data-structure mentioned above (such as [[B-tree]]s, [[hash table|hash]]es, and [[linked lists]]). Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required.
Relational DBMS's have the advantage that indexes can be created or dropped without changing existing applications making use of it. The database chooses between many different strategies based on which one it estimates will run the fastest. In other words, indexes are transparent to the application or end-user querying the database; while they affect performance, any SQL command will run with or without index to compute the result of an [[SQL]] statement. The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest. Some of the key algorithms that deal with [[join (SQL)|joins]] are [[nested loop join]], [[sort-merge join]] and [[hash join]]. Which of these is chosen depends on whether an index exists, what type it is, and its [[Cardinality (SQL statements)|cardinality]].
An index speeds up access to data, but it has disadvantages as well. First, every index increases the amount of storage on the hard drive necessary for the database file, and second, the index must be updated each time the data are altered, and this costs time.
(Thus an index saves time in the reading of data, but it costs time in entering and altering data. It thus depends on the use to which the data are to be put whether an index is on the whole a net plus or minus in the quest for efficiency.)
A special case of an index is a primary index, or primary key, which is distinguished in that the primary index must ensure a unique reference to a record. Often, for this purpose one simply uses a running index number (ID number). Primary indexes play a significant role in relational databases, and they can speed up access to data considerably.
===Transactions and concurrency===
In addition to their data model, most practical databases ("transactional databases") attempt to enforce a [[database transaction]] . Ideally, the database software should enforce the [[ACID]] rules, summarized here:
* [[Atomicity]]: Either all the tasks in a transaction must be done, or none of them. The transaction must be completed, or else it must be undone (rolled back).
* [[Database consistency|Consistency]]: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database. It cannot place the data in a contradictory state.
* [[Isolation]]: Two simultaneous transactions cannot interfere with one another. Intermediate results within a transaction are not visible to other transactions.
* [[Durability (computer science)|Durability]]: Completed transactions cannot be aborted later or their results discarded. They must persist through (for instance) restarts of the DBMS after crashes
In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance. [[Concurrency control]] is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules. The DBMS must be able to ensure that only [[serializability|serializable]], [[serializability#correctness - recoverability|recoverable]] schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions .
===Replication===
Replication of databases is closely related to transactions. If a database can log its individual actions, it is possible to create a duplicate of the data in real time.
The duplicate can be used to improve performance or availability of the whole database system.
Common replication concepts include:
* Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves
* Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas.
* Multimaster: Two or more replicas sync each other via a transaction identifier.
Parallel synchronous replication of databases enables transactions to be replicated on multiple servers simultaneously, which provides a method for backup and security as well as data availability.
===Security===
[[Database security]] denotes the system, processes, and procedures that protect a database from unintended activity.
Security is usually enforced through '''access control''', '''auditing''', and '''encryption'''.
* Access control ensures and restricts who can connect and what can be done to the database.
* Auditing logs what action or change has been performed, when and by who.
* Encryption: Since security has become a major issue in recent years, many commercial database vendors provide built-in encryption mechanism. Data is encoded natively into the tables and deciphered "on the fly" when a query comes in. Connections can also be secured and encrypted if required using DSA, MD5, SSL or legacy encryption standard.
Enforcing security is one of the major tasks of the DBA.
In the United Kingdom, legislation protecting the public from unauthorized disclosure of personal information held on databases falls under the Office of the Information Commissioner. United Kingdom based organizations holding personal data in electronic format (databases for example) are required to register with the Data Commissioner.
===Locking===
[[Lock (computer science)|Locking]] is how the database handle multiple concurent operations. This is the way how concurency and some form of basic intergrity is managed within the database system. Such locks can be applied on a row level, or on other levels like page (a basic data block), extend (multiple array of pages) or even an entire table. This helps maintain the integrity of the data by ensuring that only one process at a time can modify the '''same''' data.
Unlike a basic filesystem files or folders, where only one lock at the time can be set, restricting the usage to one process only. A database can set and hold mutiples locks at the same time on the different level of the physical data structure. How locks are set, last is determined by the database engine locking scheme based on the submitted SQL or transactions by the users. Generaly speaking no activity on the database should be translated by no or very light locking.
For most DBMS systems existing on the market, locks are generaly '''shared''' or '''exclusive'''.
Exclusive locks mean that no other lock can acquire the current data object as long as the exclusive lock lasts. Exclusive locks are usually set while the database needs to change data, like during an UPDATE or DELETE operation.
Shared locks can take ownership one from the other of the current data structure. Shared locks are usually used while the database is reading data, during a SELECT operation.
The number, nature of locks and time the lock holds a data block can have a huge impact on the database performances. Bad locking can lead to desastrous performance response (usually the result of poor SQL requests, or inadequate database physical structure)
Default locking behavior is enforced by the '''isolation level''' of the dataserver. Changing the isolation level will affect how shared or exclusive locks must be set on the data for the entire database system. Default isolation is generaly 1, where data can not be read while it is modfied, forbiding to return "ghost data" to end user.
At some point intensive or inappropriate exclusive locking, can lead to the "dead lock" situation between two locks. Where none of the locks can be released because they try to acquire ressources mutually from each other. The Database has a fail safe mecanism and will automaticly "sacrifice" one of the locks releasing the ressource. Doing so processes or transactions involved in the "dead lock" will be rolled back.
Databases can also be locked for other reasons, like access restrictions for given levels of user.
Databases are also locked for routine database maintenance, which prevents changes being made during the maintenance. See [http://publib.boulder.ibm.com/infocenter/rbhelp/v6r3/index.jsp?topic=/com.ibm.redbrick.doc6.3/wag/wag80.htm IBM] for more detail.)
===Architecture===
Depending on the intended use, there are a number of database architectures in use. Many databases use a combination of strategies.
On-line Transaction Processing systems (OLTP) often use a row-oriented datastore architecture, while data-warehouse and other retrieval-focused applications like [[Google]]'s [[BigTable]], or bibliographic database(library catalogue) systems may use a column-oriented datastore architecture.
Document-Oriented, XML, Knowledgebases, as well as frame databases and rdf-stores (aka Triple-Stores), may also use a combination of these architectures in their implementation.
Finally it should be noted that not all database have or need a database 'schema' (so called schema-less databases).
==Applications of databases==
Databases are used in many applications, spanning virtually the entire range of [[computer software]]. Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed. Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology. Software database drivers are available for most database platforms so that [[application software]] can use a common [[Application Programming Interface]] to retrieve the information stored in a database. Two commonly used database APIs are [[Java Database Connectivity|JDBC]] and [[ODBC]].
For example suppliers database contains the data relating to suppliers such as;
*supplier name
*supplier code
*supplier address
It is often used by schools to teach students and grade them.
==Links to DBMS products==
*[[4th Dimension (Software)|4D]]
*[[ADABAS]]
*[[Alpha Five]]
*[[Apache Derby]] (Java, also known as IBM Cloudscape and Sun Java DB)
*[[BerkeleyDB]]
*[[CouchDB]]
*[[CSQL]]
*[[Datawasp]]
*[[Db4objects]]
*[[dBase]]
*[[FileMaker]]
*[[Firebird (database server)]]
*[[H2 (DBMS)|H2]] (Java)
*[[Hsqldb]] (Java)
*[[IBM DB2]]
*[[Information Management System|IBM IMS (Information Management System)]]
*[[IBM UniVerse]]
*[[Informix]]
*[[Ingres (database)|Ingres]]
*[[Interbase]]
*[[InterSystems Caché]]
*[[MaxDB]] (formerly SapDB)
*[[Microsoft Access]]
*[[Microsoft SQL Server]]
*[[Model 204]]
*[[MySQL]]
*[[Nomad software|Nomad]]
*[[Objectivity/DB]]
*[[ObjectStore]]
*[[Virtuoso Universal Server|OpenLink Virtuoso]]
*[[OpenOffice.org Base]]
*[[Oracle Database]]
*[[Paradox (database)]]
*[[Polyhedra DBMS]]
*[[PostgreSQL]]
*[[Progress 4GL]]
*[[RDM Embedded]]
*[[ScimoreDB]]
*[[Sedna (database)|Sedna]]
*[[SQLite]]
*[[Superbase database|Superbase]]
*[[Sybase]]
*[[Teradata]]
*[[Vertica]]
*[[Visual FoxPro]]
Cluster analysis
'''Clustering''' is the [[Statistical classification|classification]] of objects into different groups, or more precisely, the [[partition of a set|partitioning]] of a [[data set]] into [[subset]]s (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined [[metric (mathematics)|distance measure]]. Data clustering is a common technique for [[statistics|statistical]] [[data analysis]], which is used in many fields, including [[machine learning]], [[data mining]], [[pattern recognition]], [[image analysis]] and [[bioinformatics]]. The computational task of classifying the data set into ''k'' clusters is often referred to as '''''k''-clustering'''''.
Besides the term ''data clustering'' (or just ''clustering''), there are a number of terms with similar meanings, including ''cluster analysis'', ''automatic classification'', ''numerical taxonomy'', ''botryology'' and ''typological analysis''.
== Types of clustering ==
Data clustering algorithms can be [[hierarchical]]. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
[[partition of a set|Partitional]] algorithms typically determine all clusters at once,
but can also be used as divisive algorithms in the [[hierarchical]] clustering.
''Two-way clustering'', ''co-clustering'' or [[biclustering]] are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a [[data matrix (statistics)|data matrix]], the rows and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of [[Euclidean space]] is that distances are symmetric (the distance from object'' A'' to ''B'' is the same as the distance from ''B'' to ''A''). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
== Distance measure ==
An important step in any clustering is to select a [[Distance|distance measure]], which will determine how the ''similarity'' of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
* The [[Euclidean distance]] (also called distance [[as the crow flies]] or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
* The [[Manhattan distance]] (also called taxicab norm or 1-norm)
* The [[Maximum_norm|maximum norm]]
* The [[Mahalanobis distance]] corrects data for different scales and correlations in the variables
* The angle between two vectors can be used as a distance measure when clustering high dimensional data. See [[Inner product space]].
* The [[Hamming distance]] (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
==Hierarchical clustering==
===Creating clusters===
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a [[tree data structure|tree]] (called a [[dendrogram]]), with individual elements at one end and a single cluster containing every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows indicate an agglomerative clustering.)
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
===Agglomerative hierarchical clustering===
For example, suppose this data is to be clustered, and the [[euclidean distance]] is the [[Metric (mathematics)|distance metric]].
The hierarchical clustering [[dendrogram]] would be as such:
This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a [[distance matrix]] at this stage, where the number in the ''i''-th row ''j''-th column is the distance between the ''i''-th and ''j''-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the [[single linkage clustering]] page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements ''b'' and ''c'', we now have the following clusters {''a''}, {''b'', ''c''}, {''d''}, {''e''} and {''f''}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters.
Usually the distance between two clusters and is one of the following:
* The maximum distance between elements of each cluster (also called complete linkage clustering):
::
* The minimum distance between elements of each cluster (also called [[single linkage clustering]]):
::
* The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in [[UPGMA]]):
::
* The sum of all intra-cluster variance
* The increase in variance for the cluster being merged ([[Ward's criterion]])
* The probability that candidate clusters spawn from the same distribution function (V-linkage)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
=== Concept clustering ===
Another variation of the agglomerative clustering approach is [[conceptual clustering]].
==Partitional clustering==
===''K''-means and derivatives===
====''K''-means clustering====
The [[K-means algorithm|''K''-means algorithm]] assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster...
:''Example:'' The data set has three dimensions and the cluster has two points: ''X'' = (''x''1, ''x''2, ''x''3) and ''Y'' = (''y''1, ''y''2, ''y''3). Then the centroid ''Z'' becomes ''Z'' = (''z''1, ''z''2, ''z''3), where ''z''1 = (''x''1 + ''y''1)/2 and ''z''2 = (''x''2 + ''y''2)/2 and ''z''3 = (''x''3 + ''y''3)/2.
The algorithm steps are (J. MacQueen, 1967):
* Choose the number of clusters, ''k''.
* Randomly generate ''k'' clusters and determine the cluster centers, or directly generate ''k'' random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
====Fuzzy ''c''-means clustering====
In [[fuzzy clustering]], each point has a degree of belonging to clusters, as in [[fuzzy logic]], rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be ''in the cluster'' to a lesser degree than points in the center of cluster. For each point ''x'' we have a coefficient giving the degree of being in the ''k''th cluster . Usually, the sum of those coefficients is defined to be 1:
:
With fuzzy ''c''-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
:
The degree of belonging is related to the inverse of the distance to the cluster
:
then the coefficients are normalized and fuzzyfied with a real parameter so that their sum is 1. So
:
For ''m'' equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When ''m'' is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to ''k''-means.
The fuzzy ''c''-means algorithm is very similar to the ''k''-means algorithm:
* Choose a number of clusters.
* Assign randomly to each point coefficients for being in the clusters.
* Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than , the given sensitivity threshold) :
** Compute the centroid for each cluster, using the formula above.
** For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as ''k''-means, the minimum is a local minimum, and the results depend on the initial choice of weights.
The [[Expectation-maximization algorithm]] is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-c-means.
====QT clustering algorithm====
QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than ''k''-means, but does not require specifying the number of clusters ''a priori'', and always returns the same result when run several times.
The algorithm is:
* The user chooses a maximum diameter for clusters.
* Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
* Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points ?
* [[Recursion|Recurse]] with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
=== Locality-sensitive hashing ===
[[Locality-sensitive hashing]] can be used for clustering. Feature space vectors are sets, and the metric used is the [[Jaccard distance]]. The feature space can be considered high-dimensional. The ''min-wise independent permutations'' LSH scheme (sometimes MinHash) is then used to put similar items into buckets. With just one set of hashing methods, there are only clusters of very similar elements. By seeding the hash functions several times (eg 20), it is possible to get bigger clusters.
=== Graph-theoretic methods ===
[[Formal concept analysis]] is a technique for generating clusters of objects and attributes, given a [[bipartite graph]] representing the relations between the objects and attributes. Other methods for generating ''overlapping clusters'' (a [[Cover (topology)|cover]] rather than a [[partition of a set|partition]]) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
== Elbow criterion ==
The elbow criterion is a common [[rule of thumb]] to determine what number of clusters should be chosen, for example for ''k''-means and agglomerative hierarchical clustering. It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance. Thus, it is appropriate to re-run the cluster analysis multiple times.
The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). This elbow cannot always be unambiguously identified. Percentage of variance explained is the ratio of the between-group variance to the total variance.
On the following graph, the elbow is indicated by the red circle. The number of clusters chosen should therefore be 4.
== Spectral clustering ==
Given a set of data points A, the [[similarity matrix]] may be defined as a matrix where represents a measure of the similarity between points . Spectral clustering techniques make use of the [[Spectrum of a matrix|spectrum]] of the similarity matrix of the data to perform [[dimensionality reduction]] for clustering in fewer dimensions.
One such technique is the ''[[Shi-Malik algorithm]]'', commonly used for [[segmentation (image processing)|image segmentation]]. It partitions points into two sets based on the [[eigenvector]] corresponding to the second-smallest [[eigenvalue]] of the [[Laplacian matrix]]
:
of , where is the diagonal matrix
:
This partitioning may be done in various ways, such as by taking the median of the components in , and placing all points whose component in is greater than in , and the rest in . The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the ''[[Meila-Shi algorithm]]'', which takes the [[eigenvector]]s corresponding to the ''k'' largest [[eigenvalue]]s of the matrix for some ''k'', and then invokes another (e.g. ''k''-means) to cluster points by their respective ''k'' components in these eigenvectors.
==Applications==
=== Biology ===
In [[biology]] '''clustering''' has many applications
*In imaging, data clustering may take different form based on the data dimensionality. For example, the [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture SOCR EM Mixture model segmentation activity and applet] shows how to obtain point, region or volume classification using the online [[SOCR]] computational libraries.
*In the fields of [[plant]] and [[animal]] [[ecology]], clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in [[Systematics|plant systematics]] to generate artificial [[Phylogeny|phylogenies]] or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
*In computational biology and [[bioinformatics]]:
** In [[transcriptome|transcriptomics]], clustering is used to build groups of [[genes]] with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as [[enzyme]]s for a specific [[metabolic pathway|pathway]], or genes that are co-regulated. High throughput experiments using [[expressed sequence tag]]s (ESTs) or [[DNA microarray]]s can be a powerful tool for [[genome annotation]], a general aspect of [[genomics]].
** In [[sequence analysis]], clustering is used to group homologous sequences into [[list of gene families|gene families]]. This is a very important concept in bioinformatics, and [[evolutionary biology]] in general. See evolution by [[gene duplication]].
** In high-throughput genotyping platforms clustering algorithms are used to automatically assign [[genotypes]].
=== Medicine ===
In [[medical imaging]], such as [[PET scan|PET scans]], cluster analysis can be used to differentiate between different types of [[tissue (biology)|tissue]] and [[blood]] in a three dimensional image. In this application, actual position does not matter, but the [[voxel]] intensity is considered as a [[coordinate vector|vector]], with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of [[arterial]] blood, an intrusive technique that is most common today.
=== Market research ===
Cluster analysis is widely used in [[market research]] when working with multivariate data from [[Statistical survey|surveys]] and test panels. Market researchers use cluster analysis to partition the general [[population]] of [[consumers]] into market segments and to better understand the relationships between different groups of consumers/potential [[customers]].
* Segmenting the market and determining [[target market]]s
* [[positioning (marketing)|Product positioning]]
* [[New product development]]
* Selecting test markets (see : [[experimental techniques]])
=== Other applications ===
'''Social network analysis''': In the study of [[social networks]], clustering may be used to recognize [[communities]] within large groups of people.
'''Image segmentation''': Clustering can be used to divide a [[digital]] [[image]] into distinct regions for [[border detection]] or [[object recognition]].
'''Data mining''': Many [[data mining]] applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as [[World Wide Web]] pages, into genres.
'''Search result grouping''': In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like [[Google]]. There are currently a number of web based clustering tools such as [[Clusty]].
'''Slippy map optimization''': [[Flickr]]'s map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
'''IMRT segmentation''': Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
'''Grouping of Shopping Items''': Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)
'''[[Mathematical chemistry]]''': To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 [[topological index|topological indices]].
'''Petroleum Geology''': Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
== Comparisons between data clusterings ==
There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data.
Many of these measures are derived from the [[matching matrix]] (aka [[confusion matrix]]), e.g., the [[Rand index|Rand measure]] and the Fowlkes-Mallows ''B''''k'' measures.
[[Marina Meila]]'s Variation of Information metric is a more recent approach for measuring distance between clusterings. It uses [[Mutual information|mutual information]] and [[entropy]] to approximate the distance between two clusterings across the lattice of possible clusterings.
==Algorithms==
In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are ''CLARANS'' (Ng and Han,1994), ''[[DBSCAN]]'' (Ester et al., 1996) and ''BIRCH'' (Zhang et al., 1996).
Data mining
'''Data mining''' is the process of [[sorting]] through large amounts of data and picking out relevant information. It is usually used by [[business intelligence]] organizations, and [[financial analyst]]s, but is increasingly being used in the sciences to extract information from the enormous [[data set]]s generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful [[information]] from [[data]]" and "the science of extracting useful information from large [[data set]]s or [[database]]s." Data mining in relation to [[enterprise resource planning]] is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.
==Background==
Traditionally, business analysts have performed the task of extracting useful [[information]] from recorded [[data]], but the increasing volume of data in modern business and science calls for computer-based approaches. As [[data set]]s have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of [[computers]], [[networks]], and [[sensors]] have made [[data collection]] and organization much easier. However, the captured data needs to be converted into [[information]] and [[knowledge]] to become useful. Data mining is the entire process of applying computer-based [[methodology]], including new techniques for [[knowledge discovery]], to data.
Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities. However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all.
Although data mining is a relatively new term, the technology is not. For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining). Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis. Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded. Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable.
The term data mining is often used to apply to the two separate processes of knowledge discovery and [[prediction]]. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. [[Forecasting]], or [[predictive modeling]] provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as [[neural network]]s. Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery. [[Metadata]], or data about a given data set, are often expressed in a condensed ''data-minable'' format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.
Data mining relies on the use of real world data. This data is extremely vulnerable to [[collinearity]] precisely because data from the real world may have unknown interrelations. An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed. Alternative approaches using an experiment-based approach such as [[Choice Modelling]] for human-generated data may be used. Inherent correlations are either controlled for or removed altogether through the construction of an [[experimental design]].
Recently, there were some efforts to define a standard for data mining, for example the [[CRISP-DM]] standard for analysis processes or the [[Java Data-Mining]] Standard. Independent of these standardization efforts, freely available open-source software systems like [[RapidMiner]] and [[Weka (machine learning)| Weka]] have become an informal standard for defining data-mining processes.
==Privacy concerns==
There are also [[privacy]] and [[human rights]] concerns associated with data mining, specifically regarding the source of the data analyzed. Data mining provides information that may be difficult to obtain otherwise. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.
==Notable uses of data mining==
===Combatting Terrorism===
Data mining has been cited as the method by which the U.S. Army unit [[Able Danger]] had identified the [[September 11, 2001 attacks]] leader, [[Mohamed Atta]], and three other 9/11 hijackers as possible members of an [[Al Qaeda]] cell operating in the U.S. more than a year before the attack.
It has been suggested that both the [[Central Intelligence Agency]] and the [[Canadian Security Intelligence Service]] have employed this method.
Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program [http://www.msnbc.msn.com/id/20604775/ Security-MSNBC]. These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment.
===Games===
Since the early 1960s, with the availability of [[Oracle machine|oracle]]s for certain [[combinatorial game]]s, also called [[tablebase]]s (e.g. for 3x3-chess) with any beginning configuration, small-board [[dots-and-boxes]], small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. [[Berlekamp]] in dots-and-boxes etc. and [[John Nunn]] in [[chess]] [[Chess endgame|endgames]] are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.
===Business===
Data mining in [[customer relationship management]] applications can contribute significantly to the bottom line. Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. [[Data clustering]] can also be used to automatically discover the segments or groups within a customer data set.
Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict which customers will [[Churning (stock trade)|churn]], a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to ''automated data mining''.
Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.
Another example of data mining, often called the [[market basket analysis]], relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with [[association rule]]s within transaction-based data. Not all data are transaction based and logical or inexact [[rule]]s may also be present within a [[database]]. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.
Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing." In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.
===Science and engineering===
In recent years, data mining has been widely used in area of science and engineering, such as [[bioinformatic]]s, [[genetic]]s, [[medicine]], [[education]], and [[electrical power]] engineering.
In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human [[DNA]] sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as [[cancer]]. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as [[multifactor dimensionality reduction]].
In the area of electrical power engineering, data mining techniques have been widely used for [[condition monitoring]] of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the [[insulation]]'s health status of the equipment. [[Data clustering]] such as [[self-organizing map]] (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.
Data mining techniques have also been applied for [[dissolved gas analysis]] (DGA) on [[power transformer]]s. DGA, as a diagnostics for power transformer, has been available for centuries. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.
A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention.
Other examples of applying data mining technique applications are [[biomedical]] data facilitated by domain ontologies, mining clinical trial data, [[traffic analysis]] using SOM, et cetera.
Data set
A '''data set''' (or '''dataset''') is a collection of [[data]], usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. It lists values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a [[datum]]. The data set may comprise data for one or more members, corresponding to the number of rows.
Historically, the term originated in the [[mainframe computer|mainframe field]], where it had a [[Data set (IBM mainframe)|well-defined meaning]], very close to contemporary ''[[computer file]]''. This topic is not covered here.
In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list.
The values may be numbers, such as [[real number]]s or [[integer]]s, for example representing a person's height in centimeters, but may also be [[nominal data]] (i.e., not consisting of [[numerical]] values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a [[level of measurement]]. For each variable, the values will normally all be of the same kind. However, there may also be "[[missing values]]", which need to be indicated in some way.
In [[statistics]] data sets usually come from actual observations obtained by [[sampling (statistics)|sampling]] a [[statistical population]], and each row corresponds to the observations on one element of that population. Data sets may further be generated by [[algorithms]] for the purpose of testing certain kinds of [[software]]. Some modern statistical analysis software such as [[PSPP]] still present their data in the classical dataset fashion.
== Classic data sets ==
Several classic [[data set]]s have been used extensively in the [[statistical]] literature:
* [[Iris flower data set]] - multivariate data set introduced by [[Ronald Fisher]] (1936).
* ''[[Categorical data analysis]]'' - Data sets used in the book, ''An Introduction to Categorical Data Analysis'', by Agresti are [http://lib.stat.cmu.edu/datasets/agresti provided on-line by StatLib.]
*''[[Robust statistics]]'' - Data sets used in ''Robust Regression and Outlier Detection'' (Rousseeuw and Leroy, 1986). [http://www.uni-koeln.de/themen/Statistik/data/rousseeuw/ Provided on-line at the University of Cologne.]
*''[[Time series]]'' - Data used in Chatfield's book, ''The Analysis of Time Series'', are [http://lib.stat.cmu.edu/modules.php?op=modload&name=PostWrap&file=index&page=datasets/ provided on-line by StatLib.]
*''Extreme values'' - Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' are [http://homes.stat.unipd.it/coles/public_html/ismev/ismev.dat provided on-line by Stuart Coles], the book's author.
*''Bayesian Data Analysis'' - Data used in the book, ''[[Bayesian]] Data Analysis'', are [http://www.stat.columbia.edu/~gelman/book/data/ provided on-line by Andrew Gelman], one of the book's authors.
* The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature.
ELIZA
'''ELIZA''' is a [[computer program]] by [[Joseph Weizenbaum]], designed in [[1966]], which parodied a [[Rogerian psychotherapy|Rogerian therapist]], largely by rephrasing many of the patient's statements as questions and posing them to the patient. Thus, for example, the response to "My head hurts" might be "Why do you say your head hurts?" The response to "My mother hates me" might be "Who else in your family hates you?" ELIZA was named after Eliza Doolittle, a working-class character in [[George Bernard Shaw|George Bernard Shaw's]] play ''[[Pygmalion (play)|Pygmalion]]'', who is taught to speak with an [[upper class]] [[accent (linguistics)|accent]].
==Overview==
It is sometimes inaccurately said that ELIZA simulates a therapist. Weizenbaum said that ELIZA provided a "[[parody]]" of "the responses of a non-directional psychotherapist in an initial psychiatric interview." He chose the context of psychotherapy to "sidestep the problem of giving the program a data base of real-world knowledge", the therapeutic situation being one of the few real human situations in which a human being can reply to a statement with a question that indicates very little specific knowledge of the topic under discussion. For example, it is a context in which the question "Who is your favorite composer?" can be answered acceptably with responses such as "What about your own favorite composer?" or "Does that question interest you?"
First implemented in Weizenbaum's own [[SLIP (programming language)|SLIP]] list-processing language, ELIZA worked by simple [[parsing]] and substitution of key words into canned phrases. Depending upon the initial entries by the user the illusion of a human writer could be instantly dispelled, or could continue through several interchanges. It was sometimes so convincing that there are many anecdotes about people becoming very emotionally caught up in dealing with ELIZA for several minutes until the machine's true lack of understanding became apparent. This was likely due to people's tendency to attach meanings to words which the computer never put there.
In 1966, interactive computing (via a teletype) was new. It was 15 years before the personal computer became familiar to the general public, and two decades before most people encountered attempts at [[natural language processing]] in Internet services like [[Ask.com]] or PC help systems such as Microsoft Office [[Office Assistant|Clippy]]. Although those programs included years of research and work (while ''[[Ecala]]'' eclipsed the functionality of ''ELIZA'' after less than two weeks of work by a single programmer), ''ELIZA'' remains a milestone simply because it was the first time a programmer had attempted such a human-machine interaction with the goal of creating the illusion (however brief) of human-''human'' interaction.
In the article "theNewMediaReader" an excerpt from "From Computer Power and Human Reason" by Joseph Weizenbaum in 1976, edited by Noah Wardrip-Fruin and Nick Montfort he references how quickly and deeply people became emotionally involved with the computer program, taking offence when he asked to view the transcripts, saying it was an invasion of their privacy, even asking him to leave the room while they were working with ELIZA.
==Influence on games==
ELIZA impacted a number of early [[computer games]] by demonstrating additional kinds of [[interface design]]s. [[Don Daglow]] wrote an enhanced version of the program called ''Ecala'' on a [[PDP-10]] [[mainframe computer]] at [[Pomona College]] in [[1973]] before writing what was possibly the second or third computer [[role-playing game]], ''[[Dungeon (computer game)|Dungeon]]'' ([[1975]]) (The first was probably "[[dnd (computer game)|dnd]]", written on and for the PLATO system in 1974, and the second may have been [[Moria]], written in 1975). It is likely that ''ELIZA'' was also on the system where [[Will Crowther]] created ''[[Colossal Cave Adventure|Adventure]]'', the 1975 game that spawned the [[interactive fiction]] genre. But both these games appeared some nine years after the original ''ELIZA''.
==Response and legacy==
Lay responses to ELIZA were disturbing to Weizenbaum and motivated him to write his book ''Computer Power and Human Reason: From Judgment to Calculation'', in which he explains the limits of computers, as he wants to make clear in people's minds his opinion that the anthropomorphic views of computers are just a reduction of the human being and any life form for that matter.
There are many programs based on ELIZA in different languages in addition to ''Ecala''. For example, in 1980, a company called "Don't Ask Software", founded by Randy Simon, created a version for the Apple II, Atari, and Commodore PCs, which verbally abused the user based on the user's input. In Spain, Jordi Perez developed the famous ZEBAL in 1993, written in [[Clipper programming language|Clipper]] for MS-DOS. Other versions adapted ELIZA around a religious theme, such as ones featuring Jesus (both serious and comedic) and another Apple II variant called ''I Am Buddha''. The 1980 game ''[[The Prisoner (computer game)|The Prisoner]]'' incorporated ELIZA-style interaction within its gameplay.
ELIZA has also inspired a [[podcast]] called "The Eliza Podcast", in which the host engages in self-analysis using a computer generated voice prompting with questions in the same style as the ELIZA program.
==Implementations==
* Using [[JavaScript]]: http://www.manifestation.com/neurotoys/eliza.php3
* Source code in [[Java (programming language)|Java]]: http://chayden.net/eliza/Eliza.html
* Another [[Java (programming language)|Java]]-implementation of ELIZA: http://www.wedesoft.demon.co.uk/eliza/
* Using [[C (programming language)|C]] on the [[TI-89]]: http://kaikostack.com/ti89_en.htm#eliza
* Using [[z80#The Z80 assembly language|z80 Assembly]] on the [[TI-83#TI-83 Plus|TI-83 Plus]]: http://www.ticalc.org/archives/files/fileinfo/354/35463.html
* A [[perl module]] [http://search.cpan.org/dist/Chatbot-Eliza/ Chatbot::Eliza] — [http://www.terrence.com/perl/eliza/eliza.cgi example implementation]
* Trans-Tex Software has released shareware versions for Classic Mac OS and Mac OS X: http://www.tex-edit.com/index.html#Eliza
* doctor.el (circa [[1985]]) in [[Emacs]].
* Source code in [[Tcl]]: [http://wiki.tcl.tk/9235 http://wiki.tcl.tk/9235]
* The [http://www.indyproject.org Indy] [[Delphi]] oriented TCP/IP components suite has an Eliza implementation as demo.
*[http://www.cs.bham.ac.uk/research/projects/cogaff/eliza Pop-11 Eliza] in the [[poplog]] system. Goes back to about 1976, when it was used for teaching AI at [[Sussex University]]. Now part of the free open source Poplog system.
* Source code in [[BASIC]]: http://www.atariarchives.org/bigcomputergames/showpage.php?page=22
* ECC-Eliza for Windows (actual program is for DOS, but unpacker is for Windows) (rename .txt to .exe before running): http://www5.domaindlx.com/ecceliza1/ecceliza.txt. More recent version at http://web.archive.org/web/20041117123025/http://www5.domaindlx.com/ecceliza1/ecceliza.txt.
English language
'''English''' is an [[Indo-European languages|Indo-European]], [[West Germanic languages|West Germanic language]] originating in [[England]], and is the [[first language]] for most people in the [[United Kingdom]], the [[United States]], [[Canada]], [[Australia]], [[New Zealand]], [[Republic of Ireland|Ireland]], and the [[Anglophone Caribbean]]. It is used extensively as a [[second language]] and as an [[official language]] throughout the world, especially in [[Commonwealth of Nations|Commonwealth]] countries and in many [[international organization]]s.
==Significance==
Modern English, sometimes described as the first global [[lingua franca]], is the [[Linguistic imperialism|dominant]] [[international auxiliary language|international language]] in [[communication]]s, [[science]], [[business]], [[aviation]], [[entertainment]], [[radio]] and [[diplomacy]]. The initial reason for its enormous spread beyond the bounds of the [[British Isles]] where it was originally a native tongue was the [[British Empire]], and by the late nineteenth century its influence had won a truly global reach. It is the dominant language in the [[United States]] and the growing economic and cultural influence of that [[federal union]] as a global [[superpower]] since [[World War II]] has significantly accelerated adoption of English as a language across the planet.
A working knowledge of English has become a requirement in a number of fields, occupations and professions such as medicine and as a consequence over a billion people speak English to at least a basic level (see [[English language learning and teaching]]).
Linguists such as [[David Crystal]] recognize that one impact of this massive growth of English, in common with other global languages, has been to reduce native [[Natural language#Linguistic diversity|linguistic diversity]] in many parts of the world historically, most particularly in [[Australasia]] and [[North America]], and its huge influence continues to play an important role in [[language attrition]]. By a similar token, [[historical linguistics|historical linguists]], aware of the complex and fluid dynamics of [[language change]], are always alive to the potential English contains through the vast size and spread of the communities that use it and its natural internal variety, such as in its [[English-based creole languages|creoles]] and [[pidgin]]s, to produce a new [[language family|family]] of distinct languages over time.
English is one of six official languages of the [[United Nations]].
==History==
English is a [[West Germanic languages|West Germanic]] language that originated from the [[Anglo-Frisian languages|Anglo-Frisian]] dialects brought to [[Great Britain|Britain]] by Germanic settlers and Roman auxiliary troops from various parts of what is now northwest Germany and the Northern [[Netherlands]]. Initially, [[Old English language|Old English]] was a diverse group of dialects, reflecting the varied origins of the Anglo-Saxon Kingdoms of [[England]]. One of these dialects, Late West Saxon, eventually came to dominate. The original Old English language was then influenced by two waves of invasion. The first was by language speakers of the [[North Germanic languages|Scandinavian]] branch of the Germanic family; they conquered and colonized parts of Britain in the 8th and 9th centuries. The second was the [[Normans]] in the 11th century, who spoke Old Norman and ultimately developed an English variety of this called [[Anglo-Norman]]. These two invasions caused English to become "mixed" to some degree (though it was never a truly mixed language in the strict linguistic sense of the word; mixed languages arise from the cohabitation of speakers of different languages, who develop a hybrid tongue for basic communication).
Cohabitation with the Scandinavians resulted in a significant grammatical simplification and lexical supplementation of the Anglo-Frisian core of English; the later [[Normans|Norman]] occupation led to the grafting onto that Germanic core of a more elaborate layer of words from the [[Italic languages|Italic]] branch of the European languages. This Norman influence entered English largely through the courts and government. Thus, English developed into a "borrowing" language of great flexibility and with a huge vocabulary.
== Classification and related languages ==
The English language belongs to the western sub-branch of the [[Germanic languages|Germanic branch]] of the [[Indo-European languages|Indo-European]] family of languages. The closest living relative of English is [[Scots language|Scots]], spoken primarily in Scotland and parts of Northern Ireland, which is viewed by linguists as either a separate language or a group of dialects of English. The next closest relative to English after Scots is [[Frisian languages|Frisian]], spoken in the Northern Netherlands and Northwest Germany. Other less closely related living [[West Germanic languages]] include [[Dutch language|Dutch]], [[Low German]], [[German language|German]] and [[Afrikaans]]. The [[North Germanic languages]] of Scandinavia are less closely related to English than the West Germanic languages.
Many [[French language|French]] words are also intelligible to an English speaker (though pronunciations are often quite different) because English absorbed a large vocabulary from [[Norman language|Norman]] and French, via [[Anglo-Norman]] after the Norman Conquest and directly from French in subsequent centuries. As a result, a large portion of English vocabulary is derived from French, with some minor spelling differences (word endings, use of old French spellings, etc.), as well as occasional divergences in meaning, in so-called "faux amis", or [[false friend]]s. The pronunciation of French loanwords in English has become completely anglicized and follows a typically Germanic pattern of stress.
== Geographical distribution ==
Approximately 375 million people speak English as their first language. English today is probably the third largest language by number of native speakers, after [[Mandarin (linguistics)|Mandarin Chinese]] and [[Spanish language|Spanish]]. However, when combining native and non-native speakers it is probably the most commonly spoken language in the world, though possibly second to a combination of the [[Chinese language]]s, depending on whether or not distinctions in the latter are classified as "languages" or "dialects." Estimates that include [[second language]] speakers vary greatly from 470 million to over a billion depending on how [[literacy]] or mastery is defined. There are some who claim that non-native speakers now outnumber native speakers by a ratio of 3 to 1.
The countries with the highest populations of native English speakers are, in descending order: United States (215 million), United Kingdom (58 million), Canada (18.2 million), Australia (15.5 million), [[Republic of Ireland|Ireland]] (3.8 million), South Africa (3.7 million), and New Zealand (3.0-3.7 million). Countries such as [[Jamaica]] and [[Nigeria]] also have millions of native speakers of [[dialect continuum|dialect continua]] ranging from an [[English-based creole languages|English-based creole]] to a more standard version of English. Of those nations where English is spoken as a second language, India has the most such speakers ('[[Indian English]]') and linguistics professor [[David Crystal]] claims that, combining native and non-native speakers, India now has more people who speak or understand English than any other country in the world. Following India is the [[People's Republic of China]].
===Countries in order of total speakers===
English is the primary language in [[Anguilla]], [[Antigua and Barbuda]], Australia ([[Australian English]]), the [[The Bahamas|Bahamas]], [[Barbados]], [[Bermuda]], [[Belize]] ([[Belizean Kriol language|Belizean Kriol]]), the [[British Indian Ocean Territory]], the [[British Virgin Islands]], Canada ([[Canadian English]]), the [[Cayman Islands]], the [[Falkland Islands]], [[Gibraltar]], [[Grenada]], [[Guam]], [[Guernsey]] ([[Channel Island English]]), [[Guyana]], Ireland ([[Hiberno-English]]), [[Isle of Man]] ([[Manx English]]), Jamaica ([[Jamaican English]]), [[Jersey]], [[Montserrat]], [[Nauru]], New Zealand ([[New Zealand English]]), [[Pitcairn Islands]], [[Saint Helena]], [[Saint Kitts and Nevis]], [[Saint Vincent and the Grenadines]], [[Singapore]], [[South Georgia and the South Sandwich Islands]], [[Trinidad and Tobago]], the [[Turks and Caicos Islands]], the United Kingdom, the [[United States Virgin Islands|U.S. Virgin Islands]], and the United States.
In many other countries, where English is not the most spoken language, it is an official language; these countries include [[Botswana]], [[Cameroon]], [[Dominica]], [[Fiji]], the [[Federated States of Micronesia]], [[Ghana]], [[The Gambia|Gambia]], [[India]], [[Kenya]], [[Kiribati]], [[Lesotho]], [[Liberia]], [[Madagascar]], [[Malta]], the [[Marshall Islands]], [[Mauritius]], [[Namibia]], [[Nigeria]], [[Pakistan]], [[Palau]], [[Papua New Guinea]], the [[Philippines]], [[Puerto Rico]], [[Rwanda]], the [[Solomon Islands]], [[Saint Lucia]], [[Samoa]], [[Seychelles]], [[Sierra Leone]], [[Sri Lanka]], [[Swaziland]], [[Tanzania]], [[Uganda]], [[Zambia]], and [[Zimbabwe]]. It is also one of the 11 official languages that are given equal status in South Africa ([[South African English]]). English is also the official language in current [[dependent territory|dependent territories]] of Australia ([[Norfolk Island]], [[Christmas Island]] and [[Cocos Island]]) and of the United States ([[Northern Mariana Islands]], [[American Samoa]] and [[Puerto Rico]]), and in the former British colony of [[Hong Kong]].
English is an important language in several former [[colony|colonies]] and [[protectorate]]s of the United Kingdom but falls short of official status, such as in [[Malaysia]], [[Brunei]], [[United Arab Emirates]] and [[Bahrain]]. English is also not an official language in either the United States or the United Kingdom. Although the United States federal government has no official languages, English has been given official status by 30 of the 50 state governments.
===English as a global language===
Because English is so widely spoken, it has often been referred to as a "[[world language]]", the ''[[lingua franca]]'' of the modern era. While English is not an official language in most countries, it is currently the language most often taught as a [[second language]] around the world. Some linguists believe that it is no longer the exclusive cultural sign of "native English speakers", but is rather a language that is absorbing aspects of cultures worldwide as it continues to grow. It is, by international treaty, the official language for aerial and maritime communications. English is an official language of the [[United Nations]] and many other international organizations, including the [[International Olympic Committee]].
English is the language most often studied as a foreign language in the European Union (by 89% of schoolchildren), followed by French (32%), German (18%), and Spanish (8%). In the EU, a large fraction of the population reports being able to converse to some extent in English. Among non-English speaking countries, a large percentage of the population claimed to be able to converse in English in the [[Netherlands]] (87%), [[Sweden]] (85%), [[Denmark]] (83%), [[Luxembourg]] (66%), [[Finland]] (60%), [[Slovenia]] (56%), [[Austria]] (53%), [[Belgium]] (52%), and [[Germany]] (51%). [[Norway]] and [[Iceland]] also have a large majority of competent English-speakers.
[[Book]]s, [[magazine]]s, and [[newspaper]]s written in English are available in many countries around the world. English is also the most commonly used language in the [[science]]s. In 1997, the [[Science Citation Index]] reported that 95% of its articles were written in English, even though only half of them came from authors in English-speaking countries.
=== Dialects and regional varieties ===
The expansion of the British Empire and—since WWII—the primacy of the United States have spread English throughout the globe. Because of that global spread, English has developed a host of [[List of dialects of the English language|English dialects]] and English-based [[creole language]]s and [[pidgin]]s.
The major [[Variety (linguistics)|varieties]] of English include, in most cases, several subvarieties, such as [[Cockney]] within [[British English]]; [[Newfoundland English]] within [[Canadian English]]; and [[African American Vernacular English]] ("Ebonics") and [[Southern American English]] within [[American English]]. English is a [[pluricentric language]], without a central language authority like France's [[Académie française]]; and, although no variety is clearly considered the only standard, there are a number of accents considered to be more prestigious, such as [[Received Pronunciation]] in Britain. [[Scots language|Scots]] developed—largely independently—from the same origins, but following the [[Acts of Union 1707]] a process of [[language attrition]] began, whereby successive generations adopted more and more features from English causing dialectalisation. Whether it is now a separate language or a [[dialect]] of English better described as [[Scottish English]] is in dispute. The pronunciation, grammar and lexis of the traditional forms differ, sometimes substantially, from other varieties of English.
Because of the wide use of English as a second language, English speakers have many different [[Accent (linguistics)|accents]], which often signal the speaker's native dialect or language. For the more distinctive characteristics of regional accents, see [[Regional accents of English]], and for the more distinctive characteristics of regional dialects, see [[List of dialects of the English language]].
Just as English itself has borrowed words from many different languages over its history, English [[loanword]]s now appear in a great many languages around the world, indicative of the technological and cultural influence of its speakers. Several [[pidgin]]s and [[creole language]]s have formed using an English base, such as [[Jamaican (language)|Jamaican Patois]], [[Nigerian Pidgin]], and [[Tok Pisin]]. There are many words in English coined to describe forms of particular non-English languages that contain a very high proportion of English words. [[Franglais]], for example, is used to describe French with a very high English word content; it is found on the [[Channel Islands]]. Another variant, spoken in the border bilingual regions of Québec in Canada, is called [[Franglais#Frenglish|Frenglish]]. In [[Wales]], which is part of the United Kingdom, the languages of [[Welsh language|Welsh]] and English are sometimes mixed together by fluent or comfortable Welsh speakers, the result of which is called [[Welsh English|Wenglish]].
=== Constructed varieties of English ===
* [[Basic English]] is simplified for easy international use. It is used by manufacturers and other international businesses to write manuals and communicate. Some English schools in Asia teach it as a practical subset of English for use by beginners.
* [[Special English]] is a simplified version of English used by the [[Voice of America]]. It uses a vocabulary of only 1500 words.
* [[English spelling reform|English reform]] is an attempt to improve collectively upon the English language.
* [[Seaspeak]] and the related [[NATO phonetic alphabet|Airspeak]] and Policespeak, all based on restricted vocabularies, were designed by [[Edward Johnson]] in the 1980s to aid international cooperation and communication in specific areas. There is also a [[tunnelspeak]] for use in the [[Channel Tunnel]].
* [[Euro-English]] is a concept of standardising English for use as a second language in continental Europe.
* [[Manually Coded English]] — a variety of systems have been developed to represent the English language with hand signals, designed primarily for use in deaf education. These should not be confused with true sign languages such as [[British Sign Language]] and [[American Sign Language]] used in Anglophone countries, which are independent and not based on English.
* [[E-Prime]] excludes forms of the verb ''to be''.
Euro-English (also ''EuroEnglish'' or ''Euro-English'') terms are English translations of European concepts that are not native to English-speaking countries. Because of the United Kingdom's (and even the Republic of Ireland's) involvement in the European Union, the usage focuses on non-British concepts. This kind of Euro-English was parodied when English was "made" one of the constituent languages of [[Europanto]].
== Phonology ==
=== Vowels ===
'''Notes:'''
It is the [[vowel]]s that differ most from region to region.
Where symbols appear in pairs, the first corresponds to American English, [[General American]] accent; the second corresponds to British English, [[Received Pronunciation]].
# American English lacks this sound; words with this sound are pronounced with {{IPA | /ɑ/}} or {{IPA | /ɔ/}}. See [[Phonological history of English low back vowels#Lot-cloth split|''Lot-cloth split'']].
# Some dialects of North American English do not have this vowel. See [[phonological history of English low_back vowels#Cot-caught merger|''Cot-caught merger'']].
# The North American variation of this sound is a [[r-colored vowel|rhotic vowel]].
# Many speakers of North American English do not distinguish between these two unstressed vowels. For them, ''roses'' and ''Rosa's'' are pronounced the same, and the symbol usually used is [[schwa]] {{IPA | /ə/}}.
# This sound is often transcribed with {{IPA | /i/}} or with {{IPA | /ɪ/}}.
# The diphthongs {{IPA | /eɪ/}} and {{IPA | /oʊ/}} are monophthongal for many General American speakers, as {{IPA | /eː/}} and {{IPA | /oː/}}.
# The letter <''U''> can represent either {{IPA|/u/}} or the [[iotation|iotated]] vowel {{IPA|/ju/}}. In BRP, if this iotated vowel {{IPA|/ju/}} occurs after {{IPA|/t/}}, {{IPA|/d/}}, {{IPA|/s/}} or {{IPA|/z/}}, it often triggers palatalization of the preceding consonant, turning it to {{IPA|/ʨ/}}, {{IPA|/ʥ/}}, {{IPA|/ɕ/}} and {{IPA|/ʑ/}} respectively, as in ''tune'', ''during'', ''sugar'', and ''azure''. In American English, palatalization does not generally happen unless the {{IPA|/ju/}} is followed by ''r'', with the result that {{IPA|/(t, d,s, z)jur/}} turn to {{IPA|/tʃɚ/}}, {{IPA|/dʒɚ/}}, {{IPA|/ʃɚ/}} and {{IPA|/ʒɚ/}} respectively, as in ''nature'', ''verdure'', ''sure'', and ''treasure''.
# [[Vowel length]] plays a phonetic role in the majority of English dialects, and is said to be phonemic in a few dialects, such as [[Australian English]] and [[New Zealand English]]. In certain dialects of the modern English language, for instance [[General American]], there is allophonic vowel length: vowel phonemes are realized as long vowel allophones before voiced consonant phonemes in the coda of a syllable. Before the [[Great Vowel Shift]], vowel length was phonemically contrastive.
# This sound only occurs in non-rhotic accents. In some accents, this sound may be, instead of {{IPA|/ʊə/}}, {{IPA|/ɔ:/}}. See [[English-language vowel changes before historic r]].
# This sound only occurs in non-rhotic accents. In some accents, the schwa offglide of {{IPA|/ɛə/}} may be dropped, monophthising and lengthening the sound to {{IPA|/ɛ:/}}.
See also [[IPA chart for English dialects]] for more vowel charts.
=== Consonants ===
This is the English consonantal system using symbols from the [[International Phonetic Alphabet]] (IPA).
# The [[velar nasal]] {{IPA | [ŋ]}} is a non-phonemic allophone of /n/ in some northerly British accents, appearing only before /k/ and /g/. In all other dialects it is a separate phoneme, although it only occurs in [[syllable coda]]s.
# The [[alveolar tap]] {{IPA | [ɾ]}} is an allophone of /t/ and /d/ in unstressed syllables in [[North American English]] and [[Australian English]]. This is the sound of ''tt'' or ''dd'' in the words ''latter'' and ''ladder'', which are homophones for many speakers of North American English. In some accents such as [[Scottish English]] and [[Indian English]] it replaces {{IPA|/ɹ/}}. This is the same sound represented by single ''r'' in most varieties of [[Spanish language|Spanish]].
# In some dialects, such as [[Cockney]], the interdentals /θ/ and /ð/ are usually merged with /f/ and /v/, and in others, like [[African American Vernacular English]], /ð/ is merged with dental /d/. In some Irish varieties, /θ/ and /ð/ become the corresponding dental plosives, which then contrast with the usual alveolar plosives.
# The sounds {{IPA | /ʃ/, /ʒ/, and /ɹ/}} are labialised in some dialects. Labialisation is never contrastive in initial position and therefore is sometimes not transcribed. Most speakers of [[General American]] realize (always rhoticized) as the [[retroflex approximant]] {{IPA|/ɻ/}}, whereas the same is realized in [[Scottish English]], etc. as the [[alveolar trill]].
# The [[voiceless palatal fricative]] /ç/ is in most accents just an [[allophone]] of /h/ before /j/; for instance ''human'' /çjuːmən/. However, in some accents (see [[Phonological history of English consonant clusters|this]]), the /j/ is dropped, but the initial consonant is the same.
# The [[voiceless velar fricative]] /x/ is used by Scottish or Welsh speakers of English for Scots/Gaelic words such as ''loch'' {{IPA | /lɒx/}} or by some speakers for loanwords from German and Hebrew like ''Bach'' {{IPA|/bax/}} or ''Chanukah'' /xanuka/. /x/ is also used in South African English. In some dialects such as [[Scouse]] ([[Liverpool]]) either {{IPA|[x]}} or the [[affricate consonant|affricate]] {{IPA|[kx]}} may be used as an [[allophone]] of /k/ in words such as ''docker'' {{IPA | [dɒkxə]}}. Most native speakers have a great deal of trouble pronouncing it correctly when learning a foreign language. Most speakers use the sounds [k] and [h] instead.
# Voiceless w {{IPA | [ʍ]}} is found in Scottish and Irish English, as well as in some varieties of American, New Zealand, and English English. In most other dialects it is merged with /w/, in some dialects of Scots it is merged with /f/.
==== Voicing and aspiration ====
[[Voice (phonetics)|Voicing]] and [[aspiration (phonetics)|aspiration]] of [[stop consonant]]s in English depend on dialect and context, but a few general rules can be given:
* Voiceless [[stop consonant|plosives]] and [[affricate consonant|affricates]] (/{{IPA | p}}/, /{{IPA | t}}/, /{{IPA | k}}/, and /{{IPA | tʃ}}/) are aspirated when they are word-initial or begin a stressed syllable — compare ''pin'' {{IPA | [pʰɪn]}} and ''spin'' {{IPA | [spɪn]}}, ''crap'' {{IPA | [kʰɹ̥æp]}} and ''scrap'' {{IPA | [skɹæp]}}.
** In some dialects, aspiration extends to unstressed syllables as well.
** In other dialects, such as [[Indian English]], all voiceless stops remain unaspirated.
* Word-initial voiced plosives may be devoiced in some dialects.
* Word-terminal voiceless plosives may be unreleased or accompanied by a glottal stop in some dialects (e.g. many varieties of [[American English]]) — examples: ''tap'' [{{IPA |tʰæp̚}}], ''sack'' [{{IPA |sæk̚}}].
* Word-terminal voiced plosives may be devoiced in some dialects (e.g. some varieties of [[American English]]) — examples: ''sad'' [{{IPA |sæd̥}}], ''bag'' [{{IPA |bæɡ̊}}]. In other dialects they are fully voiced in final position, but only partially voiced in initial position.
=== Supra-segmental features ===
==== Tone groups ====
English is an [[Intonation (linguistics)|intonation language]]. This means that the [[pitch (music)|pitch]] of the [[human voice|voice]] is used [[Syntax|syntactically]], for example, to convey [[surprise (emotion)|surprise]] and [[irony]], or to change a [[sentence (linguistics)|statement]] into a [[question]].
In English, intonation patterns are on groups of words, which are called tone groups, tone units, intonation groups or sense groups. Tone groups are said on a single breath and, as a consequence, are of limited length, more often being on average five words long or lasting roughly two seconds. For example:
: -{{IPA | /duː juː niːd ˈɛnɪˌθɪŋ/}} ''Do you need anything?''
: -{{IPA | /aɪ dəʊnt | nəʊ/}} ''I don't, no''
: -{{IPA | /aɪ dəʊnt nəʊ/}} ''I don't know'' (contracted to, for example, -{{IPA | /aɪ dəʊnəʊ/}} or {{IPA | /aɪ dənəʊ/}} ''I dunno'' in fast or colloquial speech that de-emphasises the pause between don't and know even further)
==== Characteristics of intonation ====
English is a strongly stressed language, in that certain syllables, both within words and within phrases, get a relative prominence/loudness during pronunciation while the others do not. The former kind of syllables are said to be ''accentuated/stressed'' and the latter are ''unaccentuated/unstressed''. All good dictionaries of English mark the accentuated syllable(s) by either placing an apostrophe-like ( {{IPA | ˈ}} ) sign either before (as in [[International Phonetic Alphabet|IPA]], [[Oxford English Dictionary]], or [[Merriam-Webster]] dictionaries) or after (as in many other dictionaries) the syllable where the stress accent falls.
Hence in a sentence, each tone group can be subdivided into syllables, which can either be stressed (strong) or unstressed (weak). The stressed syllable is called the nuclear syllable. For example:
: ''That | was | the | '''best''' | thing | you | could | have | '''done'''!''
Here, all syllables are unstressed, except the syllables/words ''best'' and ''done'', which are stressed. ''Best'' is stressed harder and, therefore, is the nuclear syllable.
The nuclear syllable carries the main point the speaker wishes to make. For example:
: ''John'' had not stolen that money. (... Someone else had.)
: John ''had not'' stolen that money. (... Someone said he had. or ... Not at that time, but later he did.)
: John had not ''stolen'' that money. (... He acquired the money by some other means.)
: John had not stolen ''that'' money. (... He had stolen some other money.)
: John had not stolen that ''money''. (... He had stolen something else.)
Also
: ''I'' did not tell her that. (... Someone else told her)
: I ''did not'' tell her that. (... You said I did. or ... but now I will)
: I did not ''tell'' her that. (... I did not say it; she could have inferred it, etc)
: I did not tell ''her'' that. (... I told someone else)
: I did not tell her ''that''. (... I told her something else)
This can also be used to express emotion:
: ''Oh'' really? (...I did not know that)
: Oh ''really''? (...I disbelieve you. or ... That's blatantly obvious)
The nuclear syllable is spoken more loudly than the others and has a characteristic '''change of pitch'''. The changes of pitch most commonly encountered in English are the '''rising pitch''' and the '''falling pitch''', although the '''fall-rising pitch''' and/or the '''rise-falling pitch''' are sometimes used. In this opposition between falling and rising pitch, which plays a larger role in English than in most other languages, falling pitch conveys certainty and rising pitch uncertainty. This can have a crucial impact on meaning, specifically in relation to polarity, the positive–negative opposition; thus, falling pitch means "polarity known", while rising pitch means "polarity unknown". This underlies the rising pitch of yes/no questions. For example:
: ''When do you want to be paid?''
: ''Now?'' (Rising pitch. In this case, it denotes a question: "Can I be paid now?" or "Do you desire to pay now?")
: ''Now.'' (Falling pitch. In this case, it denotes a statement: "I choose to be paid now.")
== Grammar ==
English grammar has minimal [[inflection]] compared with most other [[Indo-European languages]]. For example, Modern English, unlike Modern German or Dutch and the [[Romance languages]], lacks [[grammatical gender]] and [[Agreement (linguistics)|adjectival agreement]]. [[Grammatical case|Case]] marking has almost disappeared from the language and mainly survives in [[pronoun]]s. The patterning of [[Strong inflection|strong]] (e.g. ''speak/spoke/spoken'') versus [[Germanic weak verb|weak verbs]] inherited from its Germanic origins has declined in importance in modern English, and the remnants of inflection (such as [[plural]] marking) have become more regular.
At the same time, the language has become more [[Isolating language|analytic]], and has developed features such as [[modal verb]]s and [[word order]] as resources for conveying meaning. [[Auxiliary verb]]s mark constructions such as questions, negative polarity, the [[Grammatical voice|passive voice]] and progressive [[grammatical aspect|aspect]].
== Vocabulary ==
The English vocabulary has changed considerably over the centuries.
Like many languages deriving from [[Proto-Indo-European language|Proto-Indo-European]] (PIE), many of the most common words in English can trace back their origin (through the Germanic branch) to PIE. Such words include the basic pronouns ''I'', from [[Old English language|Old English]] ''ic'', (cf. Latin ''ego'', Greek ''ego'', Sanskrit ''aham''), ''me'' (cf. Latin ''me'', Greek ''eme'', Sanskrit ''mam''), numbers (e.g. ''one'', ''two'', ''three'', cf. Latin ''unus, duo, tres'', Greek ''oinos'' "ace (on dice)", ''duo, treis''), common family relationships such as mother, father, brother, sister etc (cf. Greek "meter", Latin "mater", Sanskrit "matṛ"; ''mother''), names of many animals (cf. Sankrit ''mus'', Greek ''mys'', Latin ''mus''; ''mouse''), and many common verbs (cf. Greek ''gignōmi'', Latin ''gnoscere'', Hittite ''kanes'';'' to know'').
Germanic words (generally words of Old English or to a lesser extent Norse origin) tend to be shorter than the Latinate words of English, and more common in ordinary speech. This includes nearly all the basic pronouns, prepositions, conjunctions, modal verbs etc. that form the basis of English syntax and grammar. The longer Latinate words are often regarded as more elegant or educated. However, the excessive use of Latinate words is considered at times to be either pretentious or an attempt to [[obfuscation|obfuscate]] an issue. [[George Orwell]]'s [[essay]] "[[Politics and the English Language]]" is critical of this, as well as other perceived misuse of the language.
An English speaker is in many cases able to choose between Germanic and Latinate [[synonym]]s: ''come'' or ''arrive''; ''sight'' or ''vision''; ''freedom'' or ''liberty''. In some cases there is a choice between a Germanic derived word (''oversee''), a Latin derived word (''supervise''), and a French word derived from the same Latin word (''survey''). Such synonyms harbor a variety of different meanings and nuances, enabling the speaker to express fine variations or shades of thought. Familiarity with the [[etymology]] of groups of synonyms can give English speakers greater control over their [[Register (sociolinguistics)|linguistic register]]. See: [[List of Germanic and Latinate equivalents in English]].
An exception to this and a peculiarity perhaps unique to English is that the nouns for meats are commonly different from, and unrelated to, those for the animals from which they are produced, the animal commonly having a Germanic name and the meat having a French-derived one. Examples include: ''[[deer]]'' and ''[[venison]]''; ''[[cattle|cow]]'' and ''[[beef]]''; ''swine''/''[[pig]]'' and ''[[pork]]'', or ''[[domestic sheep|sheep]]'' and ''[[lamb and mutton|mutton]]''. This is assumed to be a result of the aftermath of the Norman invasion, where a French-speaking elite were the consumers of the meat, produced by Anglo-Saxon lower classes.
Since the majority of words used in informal settings will normally be Germanic, such words are often the preferred choices when a speaker wishes to make a point in an argument in a very direct way. A majority of Latinate words (or at least a majority of content words) will normally be used in more formal speech and writing, such as a [[court]]room or an [[encyclopedia]] article. However, there are other Latinate words that are used normally in everyday speech and do not sound formal; these are mainly words for concepts that no longer have Germanic words, and are generally assimilated better and in many cases do not appear Latinate. For instance, the words ''mountain'', ''valley'', ''river'', ''aunt'', ''uncle'', ''move'', ''use'', ''push'' and ''stay'' are all Latinate.
English easily accepts technical terms into common usage and often imports new words and phrases. Examples of this phenomenon include: ''[[HTTP cookie|cookie]]'', ''[[Internet]]'' and ''[[Uniform Resource Locator|URL]]'' (technical terms), as well as ''[[genre]]'', ''[[über]]'', ''[[lingua franca]]'' and ''amigo'' (imported words/phrases from French, German, modern Latin, and Spanish, respectively). In addition, [[slang]] often provides new meanings for old words and phrases. In fact, this fluidity is so pronounced that a distinction often needs to be made between formal forms of English and contemporary usage.
See also: [[sociolinguistics]].
=== Number of words in English ===
The ''General Explanations'' at the beginning of the ''Oxford English Dictionary'' states:
The vocabulary of English is undoubtedly vast, but assigning a specific number to its size is more a matter of definition than of calculation. Unlike other languages, such as [[Académie française|French]], [[List of language regulators|German]], [[Real Academia Española|Spanish]] and [[Accademia della Crusca|Italian]] there is no [[List of language regulators|Academy]] to define officially accepted words and spellings. [[Neologism]]s are coined regularly in medicine, science and technology and other fields, and new [[slang]] is constantly developed. Some of these new words enter wide usage; others remain restricted to small circles. Foreign words used in immigrant communities often make their way into wider English usage. Archaic, dialectal, and regional words might or might not be widely considered as "English".
The ''[[Oxford English Dictionary]],'' 2nd edition ''(OED2)'' includes over 600,000 definitions, following a rather inclusive policy:
The editors of ''[[Webster's Dictionary|Webster's Third New International Dictionary, Unabridged]]'' (475,000 main headwords) in their preface, estimate the number to be much higher. It is estimated that about 25,000 words are added to the language each year.
=== Word origins ===
One of the consequences of the French influence is that the vocabulary of English is, to a certain extent, divided between those words which are [[Germanic languages|Germanic]] (mostly West Germanic, with a smaller influence from the North Germanic branch) and those which are "Latinate" (Latin-derived, either directly or from Norman French or other Romance languages).
Numerous sets of statistics have been proposed to demonstrate the origins of English vocabulary. None, as yet, is considered definitive by most linguists.
A computerised survey of about 80,000 words in the old ''Shorter Oxford Dictionary'' (3rd ed.) was published in ''Ordered Profusion'' by Thomas Finkenstaedt and Dieter Wolff (1973) that estimated the origin of English words as follows:
*''[[Langues d'oïl|Langue d'oïl]]'', including French and [[Old Norman]]: [[List of English words of French origin|28.3%]]
*Latin, including modern scientific and technical Latin: 28.24%
*Other [[Germanic languages]] (including words directly inherited from [[Old English language|Old English]]): 25%
*Greek: 5.32%
*No etymology given: 4.03%
*Derived from proper names: 3.28%
*All other languages contributed less than 1%
A survey by [[Joseph M. Williams]] in ''Origins of the English Language'' of 10,000 words taken from several thousand business letters gave this set of statistics:
*French (langue d'oïl): 41%
*"Native" English: 33%
*Latin: 15%
*Danish: 2%
*Dutch: 1%
*Other: 10%
However, 83% of the 1,000 most-common, and all of the 100 most-common English words are Germanic.
==== Dutch origins ====
Words describing the navy, types of ships, and other objects or activities on the water are often from Dutch origin. ''Yacht'' (''jacht'') and ''cruiser'' (''kruiser'') are examples.
==== French origins ====
There are many [[List of English words of French origin|words of French origin in English]], such as ''competition'', ''art'', ''table'', ''publicity'', ''police'', ''role'', ''routine'', ''machine'', ''force'', and many others that have been and are being [[anglicisation|anglicised]]; they are now pronounced according to English rules of [[phonology]], rather than French. A large portion of English vocabulary is of French or [[Langues d'oïl]] origin, most derived from, or transmitted via, the [[Anglo-Norman language|Anglo-Norman]] spoken by the [[upper class]]es in [[England]] for several hundred years after the [[Norman conquest of England]].
== Writing system ==
English has been written using the [[Latin alphabet]] since around the ninth century. (Before that, Old English had been written using [[Anglo-Saxon runes]].) The spelling system, or [[orthography]], is multilayered, with elements of French, Latin and Greek spelling on top of the native Germanic system; it has grown to vary significantly from the [[phonology]] of the language. The spelling of words often diverges considerably from how they are spoken.
Though letters and sounds may not correspond in isolation, spelling rules that take into account syllable structure, phonetics, and accents are 75% or more reliable. Some phonics spelling advocates claim that English is more than 80% phonetic.
In general, [[history of the English language|the English language]], being the product of many other languages and having only been codified orthographically in the 16th century, has fewer consistent relationships between sounds and letters than many other languages. The consequence of this orthographic history is that reading can be challenging. It takes longer for students to become completely fluent readers of English than of many other languages, including French, Greek, and Spanish.
=== Basic sound-letter correspondence ===
Only the consonant letters are pronounced in a relatively regular way:
=== Written accents ===
Unlike most other Germanic languages, English has almost no [[diacritic]]s except in foreign [[loanword]]s (like the [[acute accent]] in ''café''), and in the uncommon use of a [[diaeresis]] mark (often in formal writing) to indicate that two vowels are pronounced separately, rather than as one sound (e.g. ''naïve, Zoë''). It is almost always acceptable to leave out the marks, especially in digital communications where the [[QWERTY]] keyboard lacks any marked letters, but it depends on the context where the word is used.
Some English words retain the diacritic to distinguish them from others, such as ''[[Animé (oleo-resin)|animé]], [[Investigative journalism|exposé]], [[Lamé (fencing)|lamé]], [[öre]], [[øre]], [[pâté]], [[piqué]],'' and ''[[rosé]]'', though these are sometimes also dropped (''[[résumé]]/resumé'' is usually spelled ''resume'' in the United States). There are loan words which occasionally use a diacritic to represent their pronunciation that is not in the original word, such as ''maté'', from Spanish ''[[yerba mate]]'', following the French usage, but they are extremely rare.
== Formal written English ==
A version of the language almost universally agreed upon by educated English speakers around the world is called [[formal written English]]. It takes virtually the same form no matter where in the English-speaking world it is written. In spoken English, by contrast, there are a vast number of differences between [[dialect]]s, [[Accent (linguistics)|accents]], and varieties of [[slang]], colloquial and regional expressions. In spite of this, local variations in the formal written version of the language are quite limited, being restricted largely to the [[American and British English spelling differences|spelling differences between British and American English]].
== Basic and simplified versions ==
To make English easier to read, there are some simplified versions of the language. One basic version is named ''[[Basic English]]'', a [[constructed language]] with a small number of words created by [[Charles Kay Ogden]] and described in his book ''Basic English: A General Introduction with Rules and Grammar'' (1930). The language is based on a simplified version of English. Ogden said that it would take seven years to learn English, seven months for [[Esperanto]], and seven weeks for Basic English, comparable with [[Ido]]. Thus Basic English is used by companies who need to make complex books for international use, and by language schools that need to give people some knowledge of English in a short time.
Ogden did not put any words into Basic English that could be said with a few other words and he worked to make the words work for speakers of any other language. He put his set of words through a large number of tests and adjustments. He also made the grammar simpler, but tried to keep the grammar normal for English users.
The concept gained its greatest publicity just after the [[World War II|Second World War]] as a tool for world peace. Although it was not built into a program, similar simplifications were devised for various international uses.
Another version, [[Simplified English]], exists, which is a [[Controlled natural language|controlled language]] originally developed for [[aerospace]] industry maintenance manuals. It offers a carefully limited and standardised subset of English. Simplified English has a lexicon of approved words and those words can only be used in certain ways. For example, the word ''close'' can be used in the phrase "Close the door" but not "do not go close to the landing gear".
Esperanto
is by far the most widely spoken [[constructed language|constructed]] [[international auxiliary language]] in the world.
Its name derives from ''Doktoro Esperanto,'' the [[pseudonym]] under which [[L. L. Zamenhof]] published the first book detailing Esperanto, the ''[[Unua Libro]],'' in 1887. The word ''esperanto'' means 'one who hopes' in the language itself. Zamenhof's goal was to create an easy and flexible language that would serve as a universal [[second language]] to foster peace and international understanding.
Esperanto has had continuous usage by a community estimated at between 100,000 and 2 million speakers for over a century. By most estimates, there are approximately one thousand [[Native Esperanto speakers|native speakers]].
However, no country has adopted the language [[official language|officially]]. Today, Esperanto is employed in world travel, correspondence, cultural exchange, conventions, literature, language instruction, television, and radio broadcasting. Also, there is an [[Esperanto Wikipedia]] that contains over 100,000 articles as of June 2008.
There is evidence that [[Propaedeutic value of Esperanto|learning Esperanto may provide a good foundation for learning languages in general]]. Some state education systems offer basic instruction and elective courses in Esperanto. Esperanto is also the language of instruction in one university, the [[Akademio Internacia de la Sciencoj San Marino|Akademio Internacia de la Sciencoj]] in [[San Marino]].
== History ==
Esperanto was developed in the late 1870s and early 1880s by [[ophthalmology|ophthalmologist]] [[L. L. Zamenhof|Dr. Ludovic Lazarus Zamenhof]], an [[Ashkenazi Jew]] from [[Bialystok]], now in [[Poland]] and previously in the [[Polish-Lithuanian Commonwealth]], but at the time part of the [[Russian Empire]].
After some ten years of development, which Zamenhof spent translating literature into the language as well as writing original [[prose]] and [[Poetry|verse]], the [[Unua Libro|first book of Esperanto grammar]] was published in [[Warsaw]] in July 1887. The number of speakers grew rapidly over the next few decades, at first primarily in the [[Russian empire]] and [[Eastern Europe]], then in [[Western Europe]], the [[Americas]], [[China]], and [[Japan]]. In the early years, speakers of Esperanto kept in contact primarily through correspondence and [[magazine|periodicals]], but in 1905 the first [[World Congress of Esperanto|world congress of Esperanto speakers]] was held in [[Boulogne-sur-Mer]], [[France]]. Since then world congresses have been held in different countries every year, except during the two [[world war|World Wars]]. Since the Second World War, they have been attended by an average of over 2000 and up to 6000 people.
===Relation to 20th-century totalitarianism===
As a potential vehicle for international understanding, Esperanto attracted the suspicion of many [[totalitarian]] states. The situation was especially pronounced in [[Nazi Germany]] and in the [[Soviet Union]] under [[Joseph Stalin]].
In Germany, there was additional motivation to persecute Esperanto because Zamenhof was a Jew. In his work ''[[Mein Kampf]],'' [[Hitler]] mentioned Esperanto as an example of a language that would be used by an [[International Jewry|International]] [[Jewish conspiracy|Jewish Conspiracy]] once they achieved [[world domination]]. [[Esperantist]]s were executed during [[the Holocaust]], with Zamenhof's family in particular singled out for execution.
In the early years of the Soviet Union, Esperanto was given a measure of government support, and an officially recognized Soviet Esperanto Association came into being. However, in 1937, Stalin reversed this policy. He denounced Esperanto as "the language of spies" and had Esperantists executed. The use of Esperanto remained illegal until 1956.
==Official use==
Esperanto has never been an official language of any recognized country. However, there were plans at the beginning of the 20th century to establish [[Moresnet|Neutral Moresnet]] as the world's first Esperanto state. In China, there was talk in some circles after the 1911 [[Xinhai Revolution]] about officially replacing [[Chinese language|Chinese]] with Esperanto as a means to dramatically bring the country into the twentieth century, though this policy proved untenable. In the summer of 1924, the [[American Radio Relay League]] adopted Esperanto as its official [[international auxiliary language]], and hoped that the language would be used by [[Amateur radio|radio amateurs]] in international communications, but its actual use for radio communications was negligible. In addition, the self-proclaimed [[artificial island]] [[micronation]] of [[Republic of Rose Island|Rose Island]] used Esperanto as its official language in 1968. Esperanto is the working language of several [[non-profit organization|non-profit]] international organizations such as the ''[[Sennacieca Asocio Tutmonda]]'', but most others are specifically Esperanto organizations. The largest of these, the [[World Esperanto Association]], has an official consultative relationship with the [[United Nations]] and [[UNESCO]]. The U.S. Army has published military phrasebooks in Esperanto, to be used in [[Military simulation|wargames]] by mock enemy forces. Esperanto is also the first language of teaching and administration of the [[Akademio Internacia de la Sciencoj San Marino|International Academy of Sciences San Marino]], which is sometimes called an "Esperanto University".
== Linguistic properties ==
=== Classification ===
As a [[constructed language]], Esperanto is not [[Genealogy|genealogically]] related to any [[ethnic group|ethnic]] language. It has been described as "a language [[lexicon|lexically]] predominantly [[Romance languages|Romanic]], [[morphology (linguistics)|morphologically]] intensively [[agglutination|agglutinative]] and to a certain degree [[isolating languages|isolating]] in character". The [[phonology]], [[grammar]], [[vocabulary]], and [[semantics]] are based on the western [[Indo-European languages]]. The [[phoneme|phonemic inventory]] is essentially [[Slavic languages|Slavic]], as is much of the semantics, while the [[vocabulary]] derives primarily from the [[Romance languages]], with a lesser contribution from the [[Germanic languages]]. [[Pragmatics]] and other aspects of the language not specified by Zamenhof's original documents were influenced by the native languages of early speakers, primarily [[Russian language|Russian]], [[Polish language|Polish]], [[German language|German]], and [[French language|French]]. [[Linguistic typology|Typologically]], Esperanto has [[preposition]]s and a [[information flow|pragmatic word order]] that by default is ''[[Subject Verb Object]]'' and ''[[Word order|Adjective Noun]]''. New words are formed through extensive [[prefix (linguistics)|prefix]]ing and [[suffix]]ing.
=== Writing system ===
Esperanto is written with a modified version of the [[Latin alphabet]], including six [[Letter (alphabet)|letters]] with [[diacritic]]s: [[c-circumflex|ĉ]], [[g-circumflex|ĝ]], [[h-circumflex|ĥ]], [[j-circumflex|ĵ]], [[s-circumflex|ŝ]] and [[u-breve|ŭ]] (that is, ''c, g, h, j, s'' [[circumflex]], and ''u'' [[breve]]). The alphabet does not include the letters ''q, w, x,'' or ''y'' except in unassimilated foreign names.
The 28-letter alphabet is:
'''a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z'''
All letters are pronounced approximately as in the [[IPA]], with the exception of ''c'' and the accented letters:
Two [[ASCII]]-compatible writing conventions are in use. These substitute [[Digraph (orthography)|digraph]]s for the accented letters. The original "h-convention" (''ch, gh, hh, jh, sh, u'') is based on English 'ch' and 'sh', while a more recent "[[x-convention]]" (''cx, gx, hx, jx, sx, ux'') is useful for alphabetic word sorting on a [[computer]] (''cx'' comes correctly after ''cu'', ''sx'' after ''sv'', etc.) as well as for simple conversion back into the standard [[orthography]].
Another scheme represents the superscripted letters by a [[caret]] (^), as for example: c^ or ^c.
=== Phonology ===
:''(For help with the phonetic symbols, see [[Help:IPA]])''
Esperanto has 22 [[consonant]]s, 5 [[vowel]]s, and two [[semivowel]]s, which combine with the vowels to form 6 [[diphthong]]s. (The consonant {{IPA|/j/}} and semivowel {{IPA|/i̯/}} are both written .) [[tone (linguistics)|Tone]] is not used to distinguish meanings of words. [[Stress (linguistics)|Stress]] is always on the penultimate vowel, unless a final vowel ''o'' is [[Elision|elided]], a practice which occurs mostly in [[poetry]]. For example, ''familio'' "family" is stressed {{IPA2|fa.mi.ˈli.o}}, but when found without the final o, ''famili’,'' the stress does not shift: {{IPA|[fa.mi.ˈli]}}.
==== Consonants ====
The 22 consonants are:
The sound {{IPA|/r/}} is usually [[alveolar trill|rolled]], but may be [[alveolar flap|tapped]] {{IPA|[ɾ]}}. The {{IPA|/v/}} has a normative pronunciation like an [[English language|English]] ''v,'' but is sometimes somewhere between a ''v'' and a ''w,'' {{IPA|[ʋ]}}, depending on the language background of the speaker. A semivowel {{IPA|/u̯/}} normally occurs only in [[diphthong]]s after the vowels {{IPA|/a/}} and {{IPA|/e/}}, not as a consonant {{IPA|*/w/}}. Common, if debated, [[assimilation (linguistics)|assimilation]] includes the pronunciation of {{IPA|/nk/}} as {{IPA|[ŋk]}}, as in English ''sink,'' and {{IPA|/kz/}} as {{IPA|[gz]}}, like the ''x'' in English ''example''.
A large number of consonant clusters can occur, up to three in initial position and four in medial position, as in ''instrui'' "to teach". Final clusters are uncommon except in foreign names, poetic elision of final ''o,'' and a very few basic words such as ''cent'' "hundred" and ''post'' "after".
====Vowels====
Esperanto has the five [[cardinal vowels]] of [[Spanish language|Spanish]], [[Swahili language|Swahili]], and [[Modern Greek]].
There are six falling diphthongs: ''uj, oj, ej, aj, aŭ, eŭ'' ({{IPA|/ui̯, oi̯, ei̯, ai̯, au̯, eu̯/}}).
With only five vowels, a good deal of variation is tolerated. For instance, {{IPA|/e/}} commonly ranges from {{IPA|[e]}} (French ''é'') to {{IPA|[ɛ]}} (French ''è''). The details often depend on the speaker's native language. A [[glottal stop]] may occur between adjacent vowels in some people's speech, especially when the two vowels are the same, as in ''heroo'' "hero" ({{IPA|[he.ˈro.o]}} or {{IPA|[he.ˈro.ʔo]}}) and ''praavo'' "great-grandfather" ({{IPA|[pra.ˈa.vo]}} or {{IPA|[pra.ˈʔa.vo]}}).
=== Grammar ===
Esperanto words are [[Derivation (linguistics)|derived]] by stringing together [[prefix (linguistics)|prefix]]es, [[Root (linguistics)|roots]], and [[suffix]]es. This process is regular, so that people can create new words as they speak and be understood. [[Compound (linguistics)|Compound]] words are formed with a modifier-first, [[head (linguistics)|head-final]] order, the same order as English "birdsong" ''vs.'' "songbird".
The different [[Part of speech|parts of speech]] are marked by their own suffixes: all [[common noun]]s end in ''-o,'' all [[adjective]]s in ''-a,'' all derived adverbs in ''-e,'' and all [[verb]]s in one of six [[Grammatical tense|tense]] and [[Grammatical mood|mood]] suffixes, such as [[present tense]] ''-as.''
[[Grammatical number|Plural]] nouns end in ''-oj'' (pronounced "oy"), whereas [[direct object]]s end in ''-on.'' Plural direct objects end with the combination ''-ojn'' (pronounced to rhyme with "coin"): That is, ''-o'' for a noun, plus ''-j'' for plural, plus ''-n'' for direct object. Adjectives [[Grammatical number#Effect of number on verbs and other parts of speech|agree]] with their nouns; their endings are plural ''-aj'' (pronounced "eye"), direct-object ''-an,'' and plural direct-object ''-ajn'' (pronounced to rhyme with "fine").
The suffix ''-n'' is used to indicate the goal of movement and a few other things, in addition to the direct object. See [[Esperanto grammar]] for details.
The six verb [[inflection]]s consist of three tenses and three moods. They are [[present tense]] ''-as,'' [[future tense]] ''-os,'' [[past tense]] ''-is,'' [[infinitive|infinitive mood]] ''-i,'' [[conditional mood]] ''-us,'' and [[jussive mood]] ''-u'' (used for wishes and commands). Verbs are not marked for person or number. For instance: ''kanti'' "to sing"; ''mi kantas'' "I sing"; ''mi kantis'' "I sang"; ''mi kantos'' "I will sing"; ''li kantas'' "he sings"; ''vi kantas'' "you sing".
Word order is comparatively free: Adjectives may precede or follow nouns, and subjects, verbs and objects (marked by the suffix ''-n)'' may occur in any order. However, the [[article (grammar)|article]] ''la'' "the" and [[demonstrative]]s such as ''tiu'' "this, that" almost always come before the noun, and a [[preposition]] such as ''ĉe'' "at" ''must'' come before it. Similarly, the negative ''ne'' "not" and [[conjunction]]s such as ''kaj'' "both, and" and ''ke'' "that" must precede the [[phrase]] or [[clause]] they introduce. In [[copula]]r (A = B) clauses, word order is just as important as it is in English clauses like "people are dogs" ''vs.'' "dogs are people".
====Correlatives====
A [[correlative]] is a word used to ask or answer a question of who, where, what, when, or how. Correlatives in Esperanto are set out in a systematic manner that correlates a basic [[idea]] (quantity, manner, time, ''etc.'') to a function (questioning, indicating, negating, ''etc.'')
Examples:
*''Kio estas tio?'' "What is this?"
*''Kioma estas la horo?'' "What time is it?" Note ''kioma'' rather than ''Kiu estas la horo?'' "which is the hour?", when asking for the ranking order of the hour on the clock.
*''Io falis el la ŝranko'' "Something fell out of the cupboard."
*''Homoj tiaj kiel mi ne konadas timon.'' "Men such as me know no fear."
Correlatives are declined if the case demands it:
*''Vi devas elekti ian vorton pli simpla'' "You should choose a (some kind of) simpler word." ''Ia'' receives ''-n'' because it's part of the [[direct object]].
*''Kian libron vi volas?'' "What sort of book do you want?" Contrast this with, ''Kiun libron vi volas?'' "Which book do you want?"
=== Vocabulary ===
The core vocabulary of Esperanto was defined by ''Lingvo internacia'', published by Zamenhof in 1887. It comprised 900 roots, which could be expanded into tens of thousands of words with prefixes, suffixes, and compounding. In 1894, Zamenhof published the first Esperanto [[dictionary]], ''Universala Vortaro'', with a larger set of roots. However, the rules of the language allowed speakers to borrow new roots as needed, recommending only that they look for the most international forms, and then derive related meanings from these.
Since then, many words have been borrowed, primarily but not solely from the Western European languages. Not all proposed borrowings catch on, but many do, especially [[technical terminology|technical]] and [[science|scientific]] terms. Terms for everyday use, on the other hand, are more likely to be derived from existing roots—for example ''komputilo'' (a computer) from ''komputi'' (to compute) plus the suffix ''-ilo'' (tool)—or to be covered by extending the meanings of existing words (for example ''muso'' (a mouse), as in English, now also means a computer input device). There are frequent debates among Esperanto speakers about whether a particular borrowing is justified or whether the need can be met by deriving from or extending the meaning of existing words.
In addition to the root words and the rules for combining them, a learner of Esperanto must memorize some idiomatic compounds that are not entirely straightforward. For example, ''eldoni'', literally "to give out", is used for "to publish" (a [[calque]] of words in several European languages with the same derivation), and ''vortaro'', literally "a collection of words", means "a glossary" or "a dictionary". Such forms are modeled after usage in some European languages, and speakers of other languages may find them illogical. Fossilized derivations inherited from Esperanto's source languages may be similarly obscure, such as the opaque connection the root word ''centralo'' "power station" has with ''centro'' "center". Compounds with ''-um-'' are overtly arbitrary, and must be learned individually, as ''-um-'' has no defined meaning. It turns ''dekstren'' "to the right" into ''dekstrumen'' "clockwise", and ''komuna'' "common/shared" into ''komunumo'' "community", for example.
Nevertheless, there are not nearly as many idiomatic or [[slang]] words in Esperanto as in ethnic languages, as these tend to make international communication difficult, working against Esperanto's main goal.
===Useful phrases===
Here are some useful Esperanto phrases, with [[help:IPA|IPA]] transcriptions:
* Hello: ''Saluton'' {{IPA|/sa.ˈlu.ton/}}
* What is your name?: ''Kiel vi nomiĝas?'' {{IPA|/ˈki.el vi no.ˈmi.ʤas/}}
* My name is...: ''Mi nomiĝas...'' {{IPA|/mi no.ˈmi.ʤas/}}
* How much (is it/are they)?: ''Kiom (estas)?'' {{IPA|/ˈki.om ˈes.tas/}}
* Here you are: ''Jen'' {{IPA|/jen/}}
* Do you speak Esperanto?: ''Ĉu vi parolas Esperanton?'' {{IPA|/ˈʧu vi pa.ˈro.las es.pe.ˈran.ton/}}
* I do not understand you: ''Mi ne komprenas vin'' {{IPA|/mi ˈne kom.ˈpre.nas vin/}}
* I like ''this'' one: ''Ĉi tiu plaĉas al mi'' {{IPA|/ʧi ˈti.u ˈpla.ʧas al ˈmi/}} or ''Mi ŝatas tiun ĉi'' {{IPA|/mi ˈʃa.tas ˈti.un ˈʧi/}}
* Thank you: ''Dankon'' {{IPA|/ˈdan.kon/}}
* You're welcome: ''Ne dankinde'' {{IPA|/ˈne dan.ˈkin.de/}}
* Please: ''Bonvolu'' {{IPA|/bon.ˈvo.lu/}} or ''mi petas'' {{IPA|/mi ˈpe.tas/}}
* Here's to your health: ''Je via sano'' {{IPA|/je ˈvi.a ˈsa.no/}}
* Bless you!/Gesundheit!: ''Sanon!'' {{IPA|/ˈsa.non/}}
* Congratulations!: ''Gratulon!'' {{IPA|/ɡra.ˈtu.lon/}}
* Okay: ''Bone'' {{IPA|/ˈbo.ne/}} or ''Ĝuste'' {{IPA|/ˈʤus.te/}}
* Yes: ''Jes'' {{IPA|/ˈjes/}}
* No: ''Ne'' {{IPA|/ˈne/}}
* It is a nice day: ''Estas bela tago'' {{IPA|/ˈes.tas ˈbe.la ˈta.ɡo/}}
* I love you: ''Mi amas vin'' {{IPA|/mi ˈa.mas vin/}}
* Goodbye: ''Ĝis (la) (revido)'' {{IPA|/ʤis la re.ˈvi.do/}}
* One beer, please: ''Unu bieron, mi petas.'' {{IPA|/ˈu.nu bi.ˈe.ron, mi ˈpe.tas/}}
* What is that?: ''Kio estas tio?'' {{IPA|/ˈki.o ˈes.tas ˈti.o/}}
* That is...: ''Tio estas...'' {{IPA|/ˈti.o ˈes.tas/}}
* How are you?: ''Kiel vi (fartas)?'' {{IPA|/ˈki.el vi ˈfar.tas/}}
* Good morning!: ''Bonan matenon!'' {{IPA|/ˈbo.nan ma.ˈte.non/}}
* Good evening!: ''Bonan vesperon!'' {{IPA|/ˈbo.nan ves.ˈpe.ron/}}
* Good night!: ''Bonan nokton!'' {{IPA|/ˈbo.nan ˈnok.ton/}}
* Peace!: ''Pacon!'' {{IPA|/ˈpa.tson/}}
=== Sample text ===
The following short extract gives an idea of the character of Esperanto. (Pronunciation is covered above. The main point for English speakers to remember is that the letter 'J' has the sound of the letter 'Y' in English)
* Esperanto text
:''En multaj lokoj de Ĉinio estis temploj de drako-reĝo. Dum trosekeco oni preĝis en la temploj, ke la drako-reĝo donu pluvon al la homa mondo. Tiam drako estis simbolo de la supernatura estaĵo. Kaj pli poste, ĝi fariĝis prapatro de la plej altaj regantoj kaj simbolis la absolutan aŭtoritaton de feŭda imperiestro. La imperiestro pretendis, ke li estas filo de la drako. Ĉiuj liaj vivbezonaĵoj portis la nomon drako kaj estis ornamitaj per diversaj drakofiguroj. Nun ĉie en Ĉinio videblas drako-ornamentaĵoj kaj cirkulas legendoj pri drakoj.''
*English Translation:
:In many places in China there were temples of the dragon king. During times of drought, people prayed in the temples, that the dragon king would give rain to the human world. At that time the dragon was a symbol of the supernatural. Later on, it became the ancestor of the highest rulers and symbolised the absolute authority of the feudal emperor. The emperor claimed to be the son of the dragon. All of his personal possessions carried the name ''dragon'' and were decorated with various dragon figures. Now everywhere in China dragon decorations can be seen and there circulate legends about dragons.
== Education ==
The majority of Esperanto speakers learn the language through self-directed study, online tutorials, and correspondence courses taught by volunteers. In more recent years, teaching websites like ''[[lernu!]]'' have become popular.
Esperanto instruction is occasionally available at schools, such as a [[Esperanto#Esperanto and language acquisition|pilot project involving four primary schools]] under the supervision of the [[University of Manchester]], and by one count at 69 universities. However, outside of [[China]] and [[Hungary]], these mostly involve informal arrangements rather than dedicated departments or state sponsorship. [[Eötvös Loránd University]] in Budapest had a department of Interlinguistics and Esperanto from 1966 to 2004, after which time instruction moved to vocational colleges; there are state examinations for Esperanto instructors.
Various educators have estimated that Esperanto can be learned in anywhere from one quarter to one twentieth the amount of time required for other languages. Some argue, however, that this is only true for native speakers of Western European languages. [[Claude Piron]], a psychologist formerly at the [[University of Geneva]] and Chinese-English-Russian-Spanish translator for the United Nations, argued that Esperanto is far more "brain friendly" than many ethnic languages. "Esperanto relies entirely on innate reflexes [and] differs from all other languages in that you can always trust your natural tendency to generalize patterns. [...] The same [[neuropsychology|neuropsychological]] law [— called by] [[Jean Piaget]] ''generalizing assimilation'' — applies to word formation as well as to grammar."
=== Language acquisition ===
Four primary schools in Britain, with some 230 pupils, are currently following a course in "propedeutic Esperanto", under the supervision of the University of Manchester. That is, instruction in Esperanto to raise language awareness and accelerate subsequent learning of foreign languages. Several studies demonstrate that studying Esperanto before another foreign language speeds and improves learning the second language to a greater extent than other languages which have been investigated. This appears to be because learning subsequent foreign languages is easier than learning one's first, while the use of a grammatically simple and culturally flexible auxiliary language like Esperanto lessens the first-language learning hurdle. In one study, a group of European [[secondary school]] students studied Esperanto for one year, then French for three years, and ended up with a significantly better command of French than a control group, who studied French for all four years. Similar results were found when the course of study was reduced to two years, of which six months was spent learning Esperanto. Results are not yet available from a study in Australia to see if similar benefits would occur for learning East Asian languages, but the pupils taking Esperanto did better and enjoyed the subject more than those taking other languages.
== Community ==
=== Geography and demography ===
Esperanto speakers are more numerous in Europe and East [[Asia]] than in the Americas, [[Africa]], and [[Oceania]], and more numerous in [[urban area|urban]] than in [[rural]] areas. Esperanto is particularly prevalent in the northern and eastern countries of Europe; in China, [[Korea]], Japan, and [[Iran]] within Asia; in [[Brazil]], [[Argentina]], and [[Mexico]] in the Americas; and in [[Togo]] in Africa.
====Number of speakers====
An estimate of the number of Esperanto speakers was made by the late [[Sidney S. Culbert]], a [[retirement|retired]] [[psychology]] [[professor]] at the [[University of Washington]] and a longtime Esperantist, who tracked down and tested Esperanto speakers in sample areas in dozens of countries over a period of twenty years. Culbert concluded that between one and two million people speak Esperanto at [[ILR or Foreign Service Level language ability measures|Foreign Service Level 3]], "professionally proficient" (able to communicate moderately complex ideas without hesitation, and to follow speeches, radio broadcasts, etc.). Culbert's estimate was not made for Esperanto alone, but formed part of his listing of estimates for all languages of over 1 million speakers, published annually in the [[World Almanac|World Almanac and Book of Facts]]. Culbert's most detailed account of his methodology is found in a 1989 letter to David Wolff . Since Culbert never published detailed intermediate results for particular countries and regions, it is difficult to independently gauge the accuracy of his results.
In the Almanac, his estimates for numbers of language speakers were rounded to the nearest million, thus the number for Esperanto speakers is shown as 2 million. This latter figure appears in ''[[Ethnologue]]''. Assuming that this figure is accurate, that means that about 0.03% of the world's population speaks the language. This falls short of Zamenhof's goal of a [[international auxiliary language|universal language]], but it represents a level of popularity unmatched by any other constructed language.
Marcus Sikosek (now [[Ziko van Dijk]]) has challenged this figure of 1.6 million as exaggerated. He estimated that even if Esperanto speakers were evenly distributed, assuming one million Esperanto speakers worldwide would lead one to expect about 180 in the city of [[Cologne, Germany|Cologne]]. Van Dijk finds only 30 [[fluency|fluent]] speakers in that city, and similarly smaller than expected figures in several other places thought to have a larger-than-average concentration of Esperanto speakers. He also notes that there are a total of about 20,000 members of the various Esperanto organizations (other estimates are higher). Though there are undoubtedly many Esperanto speakers who are not members of any Esperanto organization, he thinks it unlikely that there are fifty times more speakers than organization members.
[[Finnish people|Finnish]] [[linguistics|linguist]] Jouko Lindstedt, an expert on native-born Esperanto speakers, presented the following scheme to show the overall proportions of language capabilities within the Esperanto community:
* ''1,000 have Esperanto as their native language
* ''10,000 speak it fluently
* ''100,000 can use it actively
* ''1,000,000 understand a large amount passively
* ''10,000,000 have studied it to some extent at some time.''
In the absence of Dr. Culbert's detailed sampling data, or any other census data, it is impossible to state the number of speakers with certainty. Few observers, probably, would challenge the following statement from the [[website]] of the [[World Esperanto Association]]:
:Numbers of [[textbook]]s sold and membership of local societies put the number of people with some knowledge of the language in the hundreds of thousands and possibly millions.
====Native speakers====
Ethnologue reports estimates that there are 200 to 2000 native Esperanto speakers ''(denaskuloj),'' who have learned the language from birth from their Esperanto-speaking parents. This usually happens when Esperanto is the chief or only common language in an international family, but sometimes in a family of devoted Esperantists.
The most famous native speaker of Esperanto is businessman [[George Soros]]. Also notable is young Holocaust victim [[Petr Ginz]], whose drawing of the planet Earth as viewed from the moon was carried aboard the Space Shuttle ''[[Space Shuttle Columbia|Columbia]]'' in 2003 ([[STS-107]]).
=== Culture ===
Esperanto speakers can access an international [[culture]], including a large body of original as well as translated [[Esperanto literature|literature]]. There are over 25,000 Esperanto books, both originals and translations, as well as several regularly distributed [[List of Esperanto magazines|Esperanto magazines]]. Esperanto speakers use the language for free accommodations with [[Esperantist]]s in 92 countries using the [[Pasporta Servo]] or to develop [[pen pal]] friendships abroad through the Esperanto Pen Pal Service.
Every year, 1,500-3,000 Esperanto speakers meet for the [[World Congress of Esperanto]] ''(Universala Kongreso de Esperanto)''. The [[European Esperanto Union]] ''(Eǔropa Esperanto-Unio)'' regroups the national Esperanto associations of the EU member states and holds congresses every two years. The most recent was in [[Maribor, Slovenia]], in July-August 2007. It attracted 256 delegates from 28 countries, including 2 members of the [[European Parliament]], Ms. [[Małgorzata Handzlik]] of [[Poland]] and Ms. [[Ljudmila Novak]] of [[Slovenia]].
Historically, much [[Esperanto music]] has been in various folk traditions, such as ''Kaj Tiel Plu'', for example. In recent decades, more rock and other modern genres have appeared, an example being the Swedish band ''Persone''.
There are also shared [[tradition]]s, such as [[Zamenhof Day]], and shared [[behaviour]] patterns. [[Esperantist]]s speak primarily in Esperanto at [[World Esperanto Congress|international Esperanto meetings]].
Detractors of Esperanto occasionally criticize it as "having no culture". Proponents, such as Prof. [[Humphrey Tonkin]] of the [[University of Hartford]], observe that Esperanto is "culturally neutral by design, as it was intended to be a facilitator between cultures, not to be the carrier of any one national culture." The late [[Scotland|Scottish]] Esperanto author [[William Auld]] has written extensively on the subject, arguing that Esperanto is "the expression of a [[Esperanto as an international language|common human culture]], unencumbered by national frontiers. Thus it is considered a culture on its own." Others point to Esperanto's potential for strengthening a common European identity, as it combines features of several [[Esperanto etymology|European languages]].
====In popular culture====
Esperanto has been used in a number of films and novels. Typically, this is done either to add the exotic flavour of a foreign language without representing any particular ethnicity, or to avoid going to the trouble of inventing a new language. The [[Charlie Chaplin]] film ''[[The Great Dictator]]'' (1940) showed [[Warsaw ghetto|Jewish ghetto]] shops designated in Esperanto, each with the general Esperanto suffix ''-ejo'' (meaning "place for..."), in order to convey the atmosphere of some 'foreign' [[Eastern Europe|East European]] country without referencing any particular East European language.
Two full-length [[feature film]]s have been produced with [[dialogue]] entirely in Esperanto: ''[[Angoroj]],'' in 1964, and ''[[Incubus (1965 film)|Incubus]],'' a 1965 [[B-movie]] horror film. [[Canada|Canadian]] actor [[William Shatner]] learned Esperanto to a limited level so that he could star in ''Incubus''.
Other amateur productions have been made, such as a dramatisation of the novel ''Gerda Malaperis'' (Gerda Has Disappeared). A number of "mainstream" films in national languages have used Esperanto in some way, such as ''[[Gattaca]]'' (1997), in which Esperanto can be overheard on the public address system. In the 1994 film ''[[Street Fighter]]'', Esperanto is the native language of the fictional country of [[Shadaloo]], and in a barracks scene the soldiers of villain [[M. Bison]] sing a rousing Russian Army-style chorus, the "Bison Troopers Marching Song", in the language. Esperanto is also spoken and appears on signs in the film ''[[Blade: Trinity]]''.
In the British comedy ''[[Red Dwarf]]'', [[Arnold Rimmer]] is seen attempting to learn Esperanto in a number of early episodes, including ''[[Kryten (Red Dwarf episode)|Kryten]]''. In the first season, signs on the titular spacecraft are in both English and Esperanto. Esperanto is used as the universal language in the far future of [[Harry Harrison]]'s ''[[Stainless Steel Rat]]'' and ''[[Deathworld]]'' stories.
In a 1969 guest appearance on ''[[The Tonight Show]]'', [[Jay Silverheels]] of ''[[The Lone Ranger]]'' fame appeared in character as [[Tonto]] for a comedy sketch with [[Johnny Carson]], and claimed Esperanto skills as he sought new employment. The sketch ended with a statement of his ideal situation: "Tonto, to [[Toronto, Canada|Toronto]], for Esperanto, and pronto!"
Also, in the [[Danny Phantom]] Episode, "Public Enemies", Danny, Tucker, and Sam come across a ghost wolf who speaks Esperanto, but only Tucker can understand at first.
=== In Science ===
In 1921 the [[French Academy of Sciences]] recommended using Esperanto for international scientific communication. A few scientists and mathematicians, such as [[Maurice René Fréchet|Maurice Fréchet]] (mathematics), [[John C. Wells]] (linguistics), [[Helmar Frank]] (pedagogy and cybernetics), and [[Nobel Prize in Economics|Nobel laureate]] [[Reinhard Selten]] (economics) have published part of their work in Esperanto. Frank and Selten were among the founders of the [[Akademio Internacia de la Sciencoj San Marino|International Academy of Sciences]] in [[San Marino]], sometimes called the "Esperanto University", where Esperanto is the primary language of teaching and administration.
=== Goals of the movement ===
Zamenhof's intention was to create an easy-to-learn language to foster international understanding. It was to serve as an international auxiliary language, that is, as a universal second language, not to replace ethnic languages. This goal was widely shared among Esperanto speakers in the early decades of the movement. Later, Esperanto speakers began to see the language and the culture that had grown up around it as ends in themselves, even if Esperanto is never adopted by the United Nations or other international organizations.
Those Esperanto speakers who want to see Esperanto adopted officially or on a large scale worldwide are commonly called ''[[Finvenkismo|finvenkistoj]]'', from ''fina venko'', meaning "final victory", or ''pracelistoj'', from ''pracelo'', meaning "original goal". Those who focus on the intrinsic value of the language are commonly called ''[[Raumism|raŭmistoj]]'', from [[Rauma, Finland|Rauma]], [[Finland]], where a declaration on the near-term unlikelihood of the "fina venko" and the value of Esperanto culture was made at the International Youth Congress in 1980. These categories are, however, not mutually exclusive.
The [[Prague Manifesto (Esperanto)|Prague Manifesto]] (1996) presents the views of the mainstream of the Esperanto movement and of its main organisation, the World Esperanto Association ([[World Esperanto Association|UEA]]).
=== Symbols and flags ===
In 1893, C. Rjabinis and P. Deullin designed and manufactured a lapel pin for Esperantists to identify each other. The design was a circular pin with a white background and a five pointed green star. The theme of the design was the hope of the [[Continent#Number of continents|five continents]] being united by a common language.
The earliest flag, and the one most commonly used today, features a green five-pointed star against a white canton, upon a field of green. It was proposed to Zamenhof by [[Ireland|Irishman]] Richard Geoghegan, author of the first Esperanto textbook for English speakers, in 1887. In 1905, delegates to the first conference of Esperantists at Boulogne-sur-Mer unanimously approved a version that differed from the modern flag only by the superimposition of an "E" over the green star. Other variants include that for Christian Esperantists, with a white [[Christian cross]] superimposed upon the green star, and that for Leftists, with [[Red flag|the color of the field changed from green to red]].
In 1987, a second flag design was chosen in a contest organized by the UEA celebrating the first centennial of the language. It featured a white background with two stylised curved "E"s facing each other. Dubbed the "jubilea simbolo" ([[Esperanto jubilee symbol|jubilee symbol]]) , it attracted criticism from some Esperantists, who dubbed it the "melono" (melon) because of the design's elliptical shape. It is still in use, though to a lesser degree than the traditional symbol, known as the "verda stelo" (green star).
=== Religion ===
Esperanto has served an important role in several religions, such as [[Oomoto]] from Japan and [[Baha'i]] from Iran, and has been encouraged by others.
==== Oomoto ====
The [[Oomoto]] religion encourages the use of Esperanto among their followers and includes Zamenhof as one of its deified spirits.
==== Bahá'í Faith====
The [[Bahá'í Faith]] encourages the [[Bahá'í Faith and auxiliary language|use of an auxiliary international language]]. While endorsing no specific language, some Bahá'ís see Esperanto as having great potential in this role. [[Lidja Zamenhof]], the daughter of Esperanto founder [[L. L. Zamenhof]], became a Bahá'í.
Various volumes of the [[Bahá'í literature]]s and other Baha'i books have been translated into Esperanto.
==== Spiritism ====
Esperanto is also actively promoted, at least in [[Brazil]], by followers of [[Spiritism]]. The Brazilian Spiritist Federation publishes Esperanto coursebooks, translations of [[Spiritist Codification|Spiritism's basic books]], and encourages Spiritists to become Esperantists.
==== Bible translations ====
The first translation of the [[Bible]] into Esperanto was a translation of the [[Tanach]] or Old Testament done by [[L. L. Zamenhof]]. The translation was reviewed and compared with other languages' translations by a group of British clergy and scholars before publishing it at the [[British and Foreign Bible Society]] in 1910. In 1926 this was published along with a New Testament translation, in an edition commonly called the "Londona Biblio". In the 1960s, the ''Internacia Asocio de Bibliistoj kaj Orientalistoj'' tried to organize a new, ecumenical Esperanto Bible version. Since then, the Dutch Lutheran pastor Gerrit Berveling has translated the [[Deuterocanonical]] or apocryphal books in addition to new translations of the Gospels, some of the New Testament epistles, and some books of the Tanakh or Old Testament. These have been published in various separate booklets, or serialized in ''Dia Regno'', but the [[Deuterocanonical]] books have appeared in recent editions of the Londona Biblio.
==== Christianity ====
Two Roman Catholic popes, [[Pope John Paul II|John Paul II]] and [[Pope Benedict XVI|Benedict XVI]], have regularly used Esperanto in their multilingual ''[[urbi et orbi]]'' blessings at Easter and Christmas each year since Easter 1994. Christian Esperanto organizations include two that were formed early in the history of Esperanto, the [[International Union of Catholic Esperantists]] and the [[List of Esperanto organizations#Religion|International Christian Esperantists League]]. An issue of "The Friend" describes the activities of the [[Quaker]] Esperanto Society.
There are instances of Christian apologists and teachers who use Esperanto as a medium. [[Nigeria]]n [[Pastor]] Bayo Afolaranmi's "[http://groups.yahoo.com/group/spiritanutrajxo/ Spirita nutraĵo]" (spiritual food) Yahoo mailing list, for example, has hosted weekly messages since 2003. [[Chick Publications]], publisher of [[Fundamentalist Christianity|Protestant fundamentalist]] themed evangelistic tracts, has published a number of comic book style tracts by [[Jack T. Chick]] translated into Esperanto, including "This Was Your Life!" ("Jen Via Tuto Vivo!")
==== Islam ====
[[Ayatollah Khomeini]] of [[Iran]] called on Muslims to learn Esperanto and praised its use as a medium for better understanding among peoples of different religious backgrounds. After he suggested that Esperanto replace English as an international [[lingua franca]], it began to be used in the seminaries of [[Qom]]. An Esperanto translation of the [[Qur'an]] was published by the state shortly thereafter. In 1981, Khomeini and the Iranian government began to oppose Esperanto after realising that followers of the [[Bahá'í Faith]] were interested in it.
== Criticism ==
Esperanto was conceived as a language of international communication, more precisely as a universal [[second language]]. Since publication, there has been debate over whether it is possible for Esperanto to attain this position, and whether it would be an improvement for international communication if it did. There have been a number of attempts to reform the language, the most well-known of which is the language [[Ido]] which resulted in a schism in the community at the time, beginning in 1907.
Since Esperanto is a planned language, there have been many, often passionate, criticisms of minor points which are too numerous to cover here, such as Zamenhof's choice of the word ''edzo'' over something like ''spozo'' for "husband, spouse", or his choice of the Classic Greek and Old Latin singular and plural endings ''-o, -oj, -a, -aj'' over their Medieval contractions ''-o, -i, -a, -e.'' (Both these changes were adopted by the Ido reform, though Ido dispensed with adjectival agreement altogether.) See the links [[Esperanto#Criticism|below]] for examples of more general criticism. The more common points include:
* Esperanto has failed the expectations of its founder to become a universal second language. Although many promoters of Esperanto stress the few successes it has had, the fact remains that well over a century since its publication, the portion of the world that speaks Esperanto, and the number of primary and secondary schools which teach it, remain minuscule. It simply cannot compete with English in this regard.
* The vocabulary and grammar are based on major European languages, and are not universal. Often this criticism is specific to a few points such as adjectival agreement and the accusative case (generally such obvious details are all that reform projects suggest changing), but sometimes it is more general: Both the grammar and the 'international' vocabulary are difficult for many Asians, among others, and give an unfair advantage to speakers of European languages.
One attempt to address this issue is [[Lojban]], which draws from the six populous languages [[Arabic language|Arabic]], [[Chinese language|Chinese]], [[English language|English]], [[Hindi]], [[Russian language|Russian]], and [[Spanish language|Spanish]], and whose grammar is designed for computer parsing.
* The vocabulary, diacritic letters, and grammar are too dissimilar from the major Western European languages, and therefore Esperanto is not as easy as it could be for speakers of those languages to learn.
Attempts to address this issue include the younger planned languages [[Ido]] and [[Interlingua]].
* Esperanto phonology is unimaginatively provincial, being essentially [[Belorussian language|Belorussian]] with regularized stress, leaving out only the [[nasal vowel]]s, [[palatalization|palatalized consonants]], and /dz/. For example, Esperanto has phonemes such as {{IPA|/x/, /ʒ/, /ts/, /eu̯/}} ''(ĥ, ĵ, c, eŭ)'' which are rare as distinct phonemes outside Europe. (Note that none of these are found in initial position in English.)
* Esperanto has no culture. Although it has a large international literature, Esperanto does not encapsulate a specific culture.
* Esperanto is culturally European. This is due to the European derivation of its vocabulary, and more insidiously, its [[semantics]]; both infuse the language with a European world view.
* The vocabulary is too large. Rather than deriving new words from existing roots, large numbers of new roots are adopted into the language by people who think they're international, when in fact they're only European. This makes the language much more difficult for non-Europeans than it needs to be.
* Esperanto is [[sexism|sexist]]. As in English, there is no neutral pronoun for ''s/he,'' and most kin terms and titles are masculine by default and only feminine when so specified.
There have been many attempts to address this issue, of which one of the better known is [[Riism]].
* Esperanto is, looks, or sounds artificial. This criticism is primarily due to the letters with circumflex diacritics, which some find odd or cumbersome, and to the lack of fluent speakers: Few Esperantists have spent much time with fluent, let alone native, speakers, and many learn Esperanto relatively late in life, and so speak haltingly, which can create a negative impression among non-speakers. Among fluent speakers, Esperanto sounds no more artificial than any other language. Others claim that an artificial language will necessarily be deficient, due to its very nature, but the [[Hungarian Academy of Sciences]] has found that Esperanto fulfills all the requirements of a living language.
== Modifications ==
Though Esperanto itself has changed little since the publication of the ''[[Fundamento de Esperanto]]'' (Foundation of Esperanto), a number of reform projects have been proposed over the years, starting with [[Reformed Esperanto|Zamenhof's proposals in 1894]] and [[Ido]] in 1907. Several later constructed languages, such as Fasile, were based on Esperanto.
In modern times, attempts have been made to eliminate perceived sexism in the language. One example of this is [[Riism]]. However, as Esperanto has become a living language, changes are as difficult to implement as in ethnic languages.
Formal grammar
In [[formal semantics]], [[computer science]] and [[linguistics]], a '''formal grammar''' (also called '''formation rules''') is a precise description of a [[formal language]] – that is, of a [[set]] of [[String (computer science)|strings]] over some [[Alphabet (computer science)|alphabet]]. In other words, a grammar describes which of the possible sequences of symbols (strings) in a language constitute valid words or statements in that language, but it does not describe their [[semantics]] (i.e. what they mean).
The branch of mathematics that is concerned with the properties of formal grammars and languages is called [[formal language theory]].
A grammar is usually regarded as a means to [[generate]] all the valid strings of a language; it can also be used as the basis for a [[recognizer]] that determines for any given string whether it is [[grammatical]] (i.e. belongs to the language). To describe such recognizers, formal language theory uses separate formalisms, known as [[automata theory|automata]].
A grammar can also be used to [[analyze]] the strings of a language – i.e. to describe their internal structure. In computer science, this process is known as [[parsing]]. Most languages have very [[compositional semantics]], i.e. the meaning of their utterances is structured according to their [[syntax]]; therefore, the first step to describing the meaning of an utterance in language is to analyze it and look at its analyzed form (known as its [[parse tree]] in computer science, and as its [[deep structure]] in [[generative grammar]]).
== Background ==
=== Formal language ===
A ''formal language'' is an organized [[set]] of [[symbol]]s the essential feature of which is that it can
be precisely defined in terms of just the shapes and locations of those symbols. Such a language can be defined, then, without any [[reference]] to any [[meaning (linguistics)|meaning]]s of any of its expressions; it can exist before any [[formal interpretation]] is assigned to it -- that is, before it has any meaning. First order logic is expressed in some formal language. A formal grammar determines which symbols and sets of symbols are [[Formula (mathematical logic)|formula]]s in a formal language.
=== Formal systems ===
A ''formal system'' (also called a ''logical calculus'', or a ''logical system'') consists of a formal language together with a [[deductive apparatus]] (also called a ''deductive system''). The deductive apparatus may consist of a set of [[transformation rule]]s (also called ''inference rules'') or a set of [[axiom]]s, or have both. A formal system is used to [[Proof theory|derive]] one expression from one or more other expressions.
=== Formal proofs ===
A ''formal proof'' is a sequence of well-formed formulas of a formal language, the last one of which is a [[theorem]] of a formal system. The theorem is a [[syntactic consequence]] of all the wffs preceding it in the proof. For a wff to qualify as part of a proof, it must be the result of applying a rule of the deductive apparatus of some formal system to the previous wffs in the proof sequence.
=== Formal interpretations ===
An ''interpretation'' of a formal system is the assignment of meanings to the symbols, and truth-values to the sentences of a formal system. The study of formal interpretations is called [[formal semantics]]. ''Giving an interpretation'' is synonymous with ''constructing a [[Structure (mathematical logic)|model]].
== Formal grammars ==
A grammar mainly consists of a set of rules for transforming strings. (If it ''only'' consisted of these rules, it would be a [[semi-Thue system]].) To generate a string in the language, one begins with a string consisting of only a single ''start symbol'', and then successively applies the rules (any number of times, in any order) to rewrite this string. The language consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language. If there are multiple ways of generating the same single string, then the grammar is said to be [[ambiguous grammar|ambiguous]].
For example, assume the alphabet consists of and , the start symbol is and we have the following rules:
: 1.
: 2.
then we start with , and can choose a rule to apply to it. If we choose rule 1, we obtain the string . If we choose rule 1 again, we replace with and obtain the string . This process can be repeated at will until all occurrences of ''S'' are removed, and only symbols from the alphabet remain (i.e., and ). For example, if we now choose rule 2, we replace with and obtain the string , and are done. We can write this series of choices more briefly, using symbols: . The language of the grammar is the set of all the strings that can be generated using this process: .
=== Formal definition ===
In the classic formalization of generative grammars first proposed by [[Noam Chomsky]] in the 1950s, a grammar ''G'' consists of the following components:
* A finite set of ''[[nonterminal symbol]]s''.
* A finite set of ''[[terminal symbol]]s'' that is [[Disjoint sets|disjoint]] from .
* A finite set of ''production rules'', each of the form
::
:where is the [[Kleene star]] operator and denotes [[union (set theory)|set union]]. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the second string is the [[empty string]] – that is, that it contains no symbols at all – in order to avoid confusion, the empty string is often denoted with a special notation, often (, or .
* A distinguished symbol that is the ''start symbol''.
A grammar is formally defined as the ordered quad-tuple . Such a formal grammar is often called a ''rewriting system'' or a ''phrase structure grammar'' in the literature.
The operation of a grammar can be defined in terms of relations on strings:
* Given a grammar , the binary relation (pronounced as "G derives in one step") on strings in is defined by:
* the relation (pronounced as ''G derives in zero or more steps'') is defined as the [[transitive closure]] of
* the ''language'' of , denoted as , is defined as all those strings over that can be generated by starting with the start symbol and then applying the production rules in until no more nonterminal symbols are present; that is, the set .
Note that the grammar is effectively the [[semi-Thue system]] , rewriting strings in exactly the same way; the only difference is in that we distinguish specific ''nonterminal'' symbols which must be rewritten in rewrite rules, and are only interested in rewritings from the designated start symbol to strings without nonterminal symbols.
=== Example ===
''For these examples, formal languages are specified using [[set-builder notation]].''
Consider the grammar where , , is the start symbol, and consists of the following production rules:
: 1.
: 2.
: 3.
: 4.
Some examples of the derivation of strings in are:
*
*
*
:(Note on notation: reads "''L'' generates ''R'' by means of production ''i''" and the generated part is each time indicated in bold.)
This grammar defines the language where denotes a string of ''n'' consecutive 's. Thus, the language is the set of strings that consist of 1 or more 's, followed by the same number of 's, followed by the same number of 's.
=== The Chomsky hierarchy ===
When [[Noam Chomsky]] first formalized generative grammars in 1956, he classified them into types now known as the [[Chomsky hierarchy]]. The difference between these types is that they have increasingly strict production rules and can express fewer formal languages. Two important types are ''[[context-free grammar]]s'' (Type 2) and ''[[regular grammar]]s'' (Type 3). The languages that can be described with such a grammar are called ''[[context-free language]]s'' and ''[[regular language]]s'', respectively. Although much less powerful than unrestricted grammars (Type 0), which can in fact express any language that can be accepted by a [[Turing machine]], these two restricted types of grammars are most often used because [[parsing|parser]]s for them can be efficiently implemented. For example, all regular languages can be recognized by a [[finite state machine]], and for useful subsets of context-free grammars there are well-known algorithms to generate efficient [[LL parser]]s and [[LR parser]]s to recognize the corresponding languages those grammars generate.
==== Context-free grammars ====
A ''[[context-free grammar]]'' is a grammar in which the left-hand side of each production rule consists of only a single nonterminal symbol. This restriction is non-trivial; not all languages can be generated by context-free grammars. Those that can are called ''context-free languages''.
The language defined above is not a context-free language, and this can be strictly proven using the [[pumping lemma for context-free languages]], but for example the language (at least 1 followed by the same number of 's) is context-free, as it can be defined by the grammar with , , the start symbol, and the following production rules:
: 1.
: 2.
A context-free language can be recognized in time (''see'' [[Big O notation]]) by an algorithm such as [[Earley's algorithm]]. That is, for every context-free language, a machine can be built that takes a string as input and determines in time whether the string is a member of the language, where is the length of the string. Further, some important subsets of the context-free languages can be recognized in linear time using other algorithms.
==== Regular grammars ====
In [[regular grammar]]s, the left hand side is again only a single nonterminal symbol, but now the right-hand side is also restricted: It may be the empty string, or a single terminal symbol, or a single terminal symbol followed by a nonterminal symbol, but nothing else. (Sometimes a broader definition is used: one can allow longer strings of terminals or single nonterminals without anything else, making languages [[syntactic sugar|easier to denote]] while still defining the same class of languages.)
The language defined above is not regular, but the language (at least 1 followed by at least 1 , where the numbers may be different) is, as it can be defined by the grammar with , , the start symbol, and the following production rules:
:#
:#
:#
:#
:#
All languages generated by a regular grammar can be recognized in linear time by a [[finite state machine]]. Although, in practice, regular grammars are commonly expressed using [[regular expression]]s, some forms of regular expression used in practice do not strictly generate the regular languages and do not show linear recognitional performance due to those deviations.
=== Other forms of generative grammars ===
Many extensions and variations on Chomsky's original hierarchy of formal grammars have been developed more recently, both by linguists and by computer scientists, usually either in order to increase their expressive power or in order to make them easier to analyze or [[parsing|parse]]. Some forms of grammars developed include:
* [[Tree-adjoining grammar]]s increase the expressiveness of conventional generative grammars by allowing rewrite rules to operate on [[parse tree]]s instead of just strings.
* [[Affix grammar]]s and [[attribute grammar]]s allow rewrite rules to be augmented with semantic attributes and operations, useful both for increasing grammar expressiveness and for constructing practical language translation tools.
== Analytic grammars ==
Though there is very little literature on [[parsing]] [[algorithms]], most of these algorithms assume that the language to be parsed is initially ''described'' by means of a ''generative'' formal grammar, and that the goal is to transform this generative grammar into a working parser. Strictly speaking, a generative grammar does not in any way correspond to the algorithm used to parse a language, and various algorithms have different restrictions on the form of production rules that are considered well-formed.
An alternative approach is to formalize the language in terms of an analytic grammar in the first place, which more directly corresponds to the structure and semantics of a parser for the language. Examples of analytic grammar formalisms include the following:
* [[The Language Machine]] directly implements unrestricted analytic grammars. Substitution rules are used to transform an input to produce outputs and behaviour. The system can also produce [http://languagemachine.sourceforge.net/picturebook.html the lm-diagram] which shows what happens when the rules of an unrestricted analytic grammar are being applied.
* [[Top-down parsing language]] (TDPL): a highly minimalist analytic grammar formalism developed in the early 1970s to study the behavior of [[Top-down parsing|top-down parsers]].
* [[Link grammar]]s: a form of analytic grammar designed for [[linguistics]], which derives syntactic structure by examining the positional relationships between pairs of words.
* [[Parsing expression grammar]]s (PEGs): a more recent generalization of TDPL designed around the practical [[expressiveness]] needs of [[programming language]] and [[compiler]] writers.
Free software
'''Free software''' or software libre is [[software]] that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with minimal restrictions only to ensure that further recipients can also do these things. In practice, for software to be distributed as free software, the human readable form of the program (the "[[source code]]") must be made available to the recipient along with a notice granting the above permissions. Such a notice is a "[[free software licence]]", or, in theory, could be a notice saying that the source code is released into the [[public domain]].
The [[free software movement]] was conceived in 1983 by [[Richard Stallman]] to make these freedoms available to every computer user. From the late 1990s onward, [[alternative terms for free software]] came into use. "'''[[Open source software]]'''" is the most common such alternative term. Others include "'''software [[Gratis versus Libre|libre]]'''", "free, libre and open-source software" ("'''[[FOSS]]'''", or, with "libre", "'''FLOSS'''"). The antonym of free software is "''[[proprietary software]]''" or ''non-free software''.
Free software is distinct from "[[freeware]]" which is [[proprietary software]] made available free of charge. Users usually cannot study, modify, or redistribute freeware.
Since free software may be freely redistributed, it generally is available at little or no cost. Free software business models are usually based on adding value such as support, training, customization, integration, or certification. At the same time, some business models which work with [[proprietary software]] are not compatible with free software, such as those that depend on a user paying for a licence in order to lawfully use a software product.
== History ==
In the 1950s, 1960s, and 1970s, it was normal for computer users to have the freedoms that are provided by free software. [[Software]] was commonly shared by individuals who used computers and by hardware manufacturers who were glad that people were making software that made their hardware useful. In the 1970s and early 1980s, the [[software industry]] began using technical measures (such as only distributing [[Executable|binary copies]] of [[computer programs]]) to prevent [[computer users]] from being able to study and modify software.. In 1980 [[copyright]] law was extended to computer programs.
In 1983, [[Richard Stallman]], longtime member of the [[hacker (free and open source software)|hacker]] community at the [[MIT Artificial Intelligence Laboratory]], announced the [[GNU project]], saying that he had become frustrated with the effects of the change in culture of the computer industry and its users. Software development for the [[GNU operating system]] began in January 1984, and the [[Free Software Foundation]] (FSF) was founded in October 1985. He developed a free software definition and the concept of "[[copyleft]]", designed to ensure software freedom for all.
Free software is a widespread international concept, producing software used by individuals, large organizations, and governmental administrations. Free software has a very high market penetration in server-side Internet applications such as the [[Apache web server]], [[MySQL]] database, and [[PHP]] scripting language. Completely free computing environments are available as large packages of basic system software, such as the many [[GNU/Linux distribution]]s and [[FreeBSD]]. Free software [[Software development|developers]] have also created free versions of almost all commonly used desktop applications, including Web browsers, office productivity suites, and multimedia players. It is important to note, however, that in many categories, free software for individual [[workstation]]s or home users has only a fraction of the market share of its proprietary competitors. Most free software is distributed [[online]] without charge, or [[off-line]] at the [[marginal cost]] of distribution, but this pricing model is not required, and people may sell copies of free software programs for any price.
The economic viability of free software has been recognised by large corporations such as [[IBM]], [[Red Hat]], and [[Sun Microsystems]]. Many companies whose core business is not in the IT sector choose free software for their Internet information and sales sites, due to the lower initial capital investment and ability to freely customize the application packages. Also, some non-software industries are beginning to use techniques similar to those used in free software development for their research and development process; scientists, for example, are looking towards more open development processes, and hardware such as microchips are beginning to be developed with specifications released under [[copyleft]] licenses (see the [[OpenCores]] project, for instance). [[Creative Commons]] and the [[free culture movement]] have also been largely influenced by the free software movement.
===Naming===
The FSF recommends using the term "free software" rather than "open source software" because that term and the associated marketing campaign focuses on the technical issues of software development, avoiding the issue of user freedoms. "[[Libre]]" is used to avoid the ambiguity of the word "free". However, amongst English speakers, ''libre'' is primarily only used within the free software movement.
== Definition ==
The first formal definition of free software was published by FSF in February 1986. That definition, written by Richard Stallman, is still maintained today and states that software is free software if people who receive a copy of the software have the following four freedoms:
* Freedom 0: The freedom to run the program for any purpose.
* Freedom 1: The freedom to study and modify the program.
* Freedom 2: The freedom to copy the program so you can help your neighbor.
* Freedom 3: The freedom to improve the program, and release your improvements to the public, so that the whole community benefits.
Freedoms 1 and 3 require [[source code]] to be available because studying and modifying software without its source code is highly impractical.
Thus, free software means that [[user (computing)|computer users]] have the freedom to cooperate with whom they choose, and to control the software they use. To summarize this into a remark distinguishing ''[[Gratis versus Libre|libre]]'' (freedom) software from ''[[Gratis versus Libre|gratis]]'' (zero price) software, [[Richard Stallman]] said: "''Free software is a matter of liberty, not price. To understand the concept, you should think of 'free' as in '[[free speech]]', not as in '[[free beer]]'''".
In the late 90s, other groups published their own definitions which describe an almost identical set of software. The most notable are [[Debian Free Software Guidelines]] published in 1997, and the [[Open Source Definition]], published in 1998.
The BSD-based operating systems, such as [[FreeBSD]], [[OpenBSD]], and [[NetBSD]], do not have their own formal definitions of free software. Users of these systems generally find the same set of software to be acceptable, but sometimes see copyleft as restrictive. They generally advocate [[permissive free software licenses]], which allow others to make software based on their source code, and then release the modified result as proprietary software. Their view is that this permissive approach is more free. The [[Kerberos (protocol)|Kerberos]], [[X.org]], and [[Apache License|Apache]] software licenses are substantially similar in intent and implementation. All of these software packages originated in academic institutions interested in wide technology transfer ([[University of California]], [[Massachusetts Institute of Technology|MIT]], and [[University of Illinois at Urbana-Champaign|UIUC]]).
== Examples of free software ==
The [[Free Software Directory]] is a free software project that maintains a large database of free software packages.
===Notable free software===
* [[Graphical user interface|GUI]] related
**[[X Window System]]
**[[GNOME]]
**[[KDE]]
**[[Xfce]] desktop environments
* [[OpenOffice.org]] office suite
* [[Mozilla Application Suite|Mozilla]] and [[Mozilla Firefox|Firefox]] web browsers.
* Typesetting and document preparation systems
**[[TeX]]
**[[LaTeX]]
* Graphics tools like [[GIMP]] image graphics editor and [[Blender (software)|Blender]] 3D animation program.
* [[Text editor]]s like [[vi]] or [[emacs]].
* [[ogg]] is a free software multimedia container, used to hold [[ogg vorbis]] sound and [[ogg theora]] video.
* [[Relational database]] systems
**[[MySQL]]
**[[PostgreSQL]]
* [[GNU Compiler Collection|GCC]] compilers, [[GDB]] debugger and the [[GNU C Library]].
====Programming languages====
*[[Java (programming language)|Java]]
*[[Perl]]
*[[PHP]]
*[[Python (programming language)|Python]]
*[[Lua (programming language)|Lua]]
*[[Ruby programming language|Ruby]]
*[[Tcl]]
====Servers====
*[[Apache HTTP Server|Apache web server]]
*[[BIND]] name server
*[[Sendmail]] mail transport
*[[Samba software|Samba]] file server.
====Operating systems====
*[[GNU/Linux]]
*[[Berkeley Software Distribution|BSD]]
*[[Darwin (operating system)|Darwin]]
*[[OpenSolaris]]
== Free software licenses ==
All free software licenses must grant people all the freedoms discussed above. However, unless the applications' licenses are compatible, combining programs by mixing source code or directly linking binaries is problematic, because of license technicalities. Programs indirectly connected together may avoid this problem.
The majority of free software uses a small set of licenses. The most popular of these licenses are:
* the [[GNU General Public License]]
* the [[GNU Lesser General Public License]]
* the [[BSD License]]
* the [[Mozilla Public License]]
* the [[MIT License]]
* the [[Apache License]]
The Free Software Foundation and the Open Source Initiative both publish lists of licenses that they find to comply with their own definitions of free software and open-source software respectively.
* [[List of FSF approved software licenses]]
* [[List of OSI approved software licenses]]
These lists are necessarily incomplete, because a license need not be known by either organization in order to provide these freedoms.
Apart from these two organizations, the [[Debian]] project is seen by some to provide useful advice on whether particular licenses comply with their [[Debian Free Software Guidelines]]. Debian doesn't publish a list of ''approved'' licenses, so its judgments have to be tracked by checking what software they have allowed into their software archives. That is summarized at the Debian web site.
However, it is rare that a license is announced as being in-compliance by either FSF or OSI guidelines and not [[Vice_versa##vice_versa|vice versa]] (the [[Netscape Public License]] used for early versions of Mozilla being an exception), so exact definitions of the terms have not become hot issues.
=== Permissive and copyleft licenses ===
The FSF categorizes licenses in the following ways:
* [[Public domain]] software - the copyright has expired, the work was not copyrighted or the author has abandoned the copyright. Since public-domain software lacks copyright protection, it may be freely incorporated into any work, whether proprietary or free.
* [[permissive free software licences|Permissive licenses]], also called BSD-style because they are applied to much of the software distributed with the [[Berkeley Software Distribution|BSD]] operating systems. The author retains copyright solely to disclaim warranty and require proper attribution of modified works, but permits redistribution and modification in ''any'' work, even proprietary ones.
* [[Copyleft]] licenses, the [[GNU General Public License]] being the most prominent. The author retains copyright and permits redistribution and modification provided all such redistribution is licensed under the same license. Additions and modifications by others must also be licensed under the same 'copyleft' license whenever they are distributed with part of the original licensed product.
== Security and reliability==
There is debate over the [[computer security|security]] of free software in comparison to proprietary software, with a major issue being [[security through obscurity]]. A popular quantitative test in computer security is using relative counting of known unpatched security flaws. Generally, users of this method advise avoiding products which lack fixes for known security flaws, at least until a fix is available. Some claim that this method is biased by counting more vulnerabilities for the free software, since its source code is accessible and its community is more forthcoming about what problems exist.
Free software advocates rebut that even if proprietary software does not have "published" flaws, flaws could still exist and possibly be known to malicious users. The ability of users to view and modify the source code allows many more people to potentially analyse the code and possibly to have a higher rate of finding bugs and flaws than an average sized corporation could manage. Users having access to the source code also makes creating and deploying [[spyware]] far more difficult. [[David A. Wheeler]] has published research concluding that free software is quantitatively more reliable than proprietary software.
== Adoption ==
Free software played a part in the development of the Internet, the World Wide Web and the infrastructure of [[dot-com companies]].
Free software allows users to cooperate in enhancing and refining the programs they use; free software is a [[pure public good]] rather than a [[private good]]. Companies that contribute to free software can increase commercial [[innovation]] amidst the void of [[patent]] [[cross licensing]] lawsuits. (See [[Mpeg2#Patent holders|mpeg2 patent holders]])
Under the free software business model, free software vendors may charge a fee for distribution and offer pay support and software customization services. Proprietary software uses a different business model, where a customer of the proprietary software pays a fee for a license to use the software. This license may grant the customer the ability to configure some or no parts of the software themselves. Often some level of support is included in the purchase of proprietary software, but additional support services (especially for enterprise applications) are usually available for an additional fee. Some proprietary software vendors will also customize software for a fee.
Free software is generally available at little to no cost and can result in permanently lower costs compared to [[proprietary software]]. With free software, businesses can fit software to their specific needs by changing the software themselves or by hiring programmers to modify it for them. Free software often has no warranty, and more importantly, generally does not assign legal liability to anyone. However, warranties are permitted between any two parties upon the condition of the software and its usage. Such an agreement is made separately from the free software license.
== Controversies ==
=== Binary blobs ===
In 2006, [[OpenBSD]] started the first campaign against the use of [[binary blobs]], in [[kernel (computer science)|kernels]]. Blobs are usually freely distributable [[device driver]]s for hardware from vendors that do not reveal driver source code to users or developers. This restricts the users' freedom to effectively modify the software and distribute modified versions. Also, since the blobs are undocumented and may have [[computer bug|bugs]], they pose a security risk to any [[operating system]] whose kernel includes them. The proclaimed aim of the campaign against blobs is to collect hardware documentation that allows developers to write free software drivers for that hardware, ultimately enabling all free operating systems to become or remain blob-free.
The issue of binary blobs in the [[Linux kernel]] and other device drivers motivated some developers in Ireland to launch [[gNewSense]], a GNU/Linux distribution with all the binary blobs removed. The project received support from the [[Free Software Foundation]]
=== BitKeeper ===
[[Larry McVoy]] invited high-profile free software projects to use his proprietary [[versioning system]], [[BitKeeper]], free of charge, in order to attract paying users. In 2002, Linux coordinator [[Linus Torvalds]] decided to use BitKeeper to develop the Linux kernel, a free software project, claiming no free software alternative met his needs. This controversial decision drew criticism from several sources, including the Free Software Foundation's founder Richard Stallman.
Following the apparent [[reverse engineering]] of BitKeeper's protocols, McVoy withdrew permission for gratis use by free software projects, leading the Linux kernel community to develop a free software replacement in [[Git (software)|Git]].
=== Patent deals ===
In November 2006, the [[Microsoft]] and [[Novell]] software corporations announced a controversial partnership involving, among other things, patent protection for some customers of Novell under certain conditions.
Freeware
'''Freeware''' is computer [[software]] that is available for use at no cost or for an optional fee. Freeware is often made available in a binary-only, [[proprietary software|proprietary]] form, thus making it distinct from [[free software]]. Proprietary freeware allows authors to contribute something for the benefit of the community, while at the same time allowing them to retain control of the source code and preserve its business potential. Freeware is different from [[shareware]], where the user is obliged to pay (e.g. after some trial period or for additional functionality).
== History ==
The term ''freeware'' was coined by [[Andrew Fluegelman]] when he wanted to sell a communications program named [[PC-Talk]] that he had created but for which he did not wish to use traditional methods of distribution because of their cost. Fluegelman actually distributed PC-Talk via a process now referred to as [[shareware]]. Current use of the term freeware does not necessarily match the original concept by Andrew Fluegelman.
== Criteria ==
The only criterion for being classified as freeware is that the software must be fully functional for an unlimited time with no monetary cost. The software license may impose one or more other restrictions on the type of use including personal use, individual use, non-profit use, non-commercial use, academic use, commercial use or any combination of these. For instance, the license may be "free for personal, non-commercial use." Everything created with the freeware programs can be distributed at no cost (for example graphic, documents, or sounds made by user).
French language
'''French''' (''français'', ) is today spoken around the world by 72 to 130 million people as a [[first language|native]] language, and by about 190 to 600 million people as a [[second language|second]] or third language, with significant speakers in 54 countries. Most native speakers of the language live in [[France]], where the language originated. The rest live in [[Canada]], [[Belgium]] and [[Switzerland]].
French is a descendant of the [[Latin]] language of the [[Roman Empire]], as are languages such as [[Portuguese language|Portuguese]], [[Spanish language|Spanish]], [[Italian language|Italian]], [[Catalan language|Catalan]] and [[Romanian language|Romanian]]. Its development was also influenced by the native [[Celtic languages]] of Roman [[Gaul]] and by the [[Germanic languages|Germanic]] language of the post-Roman [[Frankish]] invaders.
It is an [[official language]] in [[List of countries where French is an official language|29 countries]], most of which form what is called in French ''La [[Francophonie]]'', the community of French-speaking nations. It is an official language of all [[United Nations]] agencies and a [[List of international organisations which have French as an official language|large number of international organizations]]. According to the [[European Union]], 129 million (26% of the 497,198,740) people in 27 member states speak French, of which 59 million (12%) speak it natively and 69 million (14%) claim to speak it as a second language, which makes it the third most spoken second language in the Union, after English and German respectively.
== Geographic distribution==
===Europe===
====Legal status in France====
Per the [[Constitution of France]], French has been the official language since 1992 (although previous legal texts have made it official since 1539, see [[ordinance of Villers-Cotterêts]]). [[France]] mandates the use of French in official government publications, public [[education]] outside of specific cases (though these dispositions are often ignored) and legal [[contract]]s; [[advertisement]]s must bear a translation of foreign words.
In addition to French, there are also a variety of regional languages. France has signed the European Charter for Regional Languages but has not ratified it since that would go against the 1958 Constitution.
====Switzerland====
French is one of the four official languages of [[Switzerland]] (along with [[German language|German]], [[Italian language|Italian]], and [[Romansh language|Romansh]]) and is spoken in the part of Switzerland called ''[[Romandie]]''. French is the native language of about 20% of the Swiss population.
====Belgium====
In [[Belgium]], French is the official language of [[Wallonia]] (excluding the [[East Cantons]], which are [[German language|German-speaking]]) and one of the two official languages—along with [[Dutch language|Dutch]]—of the [[Brussels-Capital Region]] where it is spoken by the majority of the population, though often not as their primary language. French and German are not official languages nor recognised minority languages in the [[Flemish Region]], although along borders with the Walloon and Brussels-Capital regions, there are a dozen of [[municipalities with language facilities]] for French-speakers; a mirroring situation exists for the Walloon Region with respect to the Dutch and German languages. In total, native French-speakers make up about 40% of the country's population, the remaining 60% speak Dutch, the latter of which 59% claim to speak French as a second language. French is thus known by an estimated 75% of all Belgians, either as a mother tongue, as second, or as third language.
====Monaco and Andorra====
Although [[Monégasque language|Monégasque]] is the national language of the [[Principality of Monaco]], French is the only official language, and French nationals make up some 47% of the population. [[Catalan language|Catalan]] is the only official language of [[Andorra]]; however, French is commonly used due to the proximity to France. French nationals make up 7% of the population.
====Italy====
French is also an official language, along with [[Italian language|Italian]], in the province of [[Aosta Valley]], [[Italy]]. In addition, a number of [[Franco-Provençal language|Franco-Provençal]] dialects are spoken in the province, although they do not have official recognition.
====Luxembourg====
French is one of three official languages of [[Luxembourg|the Grand Duchy of Luxembourg ]] ;
the other official languages of Luxembourg are
*[[German language|German]]
*[[Lëtzebuergesch|Luxemburgish]].
Luxemburgish is the natively-spoken language of Luxembourg ;
Luxembourg's education system is trilingual: the first years of primary school are in Luxembourgish, before changing to German, while secondary school, the language of instruction changes to French.
====The Channel Islands====
Although [[Jersey]] and [[Guernsey]], the two bailiwicks collectively referred to as the [[Channel Islands]], are separate entities, both use French to some degree, mostly in an administrative capacity. [[Jersey Legal French]] is the standardized variety used in Jersey.
===The Americas===
====Legal status in Canada====
About 7 million [[Canadian]]s are native French-speakers, of whom 6 million live in [[Quebec]], and French is one of [[Canada]]'s two official languages (the other being [[English language|English]]). Various provisions of the [[Canadian Charter of Rights and Freedoms]] deal with Canadians' right to access services in both languages, including the right to a publicly funded education in the minority language of each province, where numbers warrant in a given locality. By [[law]], the federal government must operate and provide services in both English and French, proceedings of the [[Parliament of Canada]] must be translated into both these languages, and most products sold in Canada must have labeling in both languages.
Overall, about 13% of Canadians have knowledge of French only, while 18% have knowledge of both English and French. In contrast, over 82% of the population of Quebec speaks French natively, and almost 96% speak it as either their first or second language. It has been the sole official language of Quebec since 1974. The legal status of French was further strengthened with the 1977 adoption of the [[Charter of the French Language]] (popularly known as ''Bill 101''), which guarantees that every person has a right to have the civil administration, the health and social services, corporations, and enterprises in Quebec communicate with him in French. While the Charter mandates that certain provincial government services, such as those relating to health and education, be offered to the English minority in its language, where numbers warrant, its primary purpose is to cement the role of French as the primary language used in the public sphere.
[[Image:Knowledge French EU map.png|right|thumb|240px|Knowledge of French in the European Union and candidate countries]]
The provision of the Charter that has arguably had the most significant impact mandates French-language [[education]] unless a child's parents or siblings have received the majority of their own primary education in English within Canada, with minor exceptions. This measure has reversed a historical trend whereby a large number of immigrant children would attend English schools. In so doing, the Charter has greatly contributed to the "visage français" (French face) of Montreal in spite of its growing immigrant population. Other provisions of the Charter have been ruled unconstitutional over the years, including those mandating French-only commercial signs, court proceedings, and debates in the legislature. Though none of these provisions are still in effect today, some continued to be on the books for a time even after courts had ruled them unconstitutional as a result of the government's decision to invoke the so-called [[Section Thirty-three of the Canadian Charter of Rights and Freedoms|notwithstanding clause]] of the Canadian constitution to override constitutional requirements. In 1993, the Charter was rewritten to allow signage in other languages so long as French was markedly "predominant." Another section of the Charter guarantees every person the right to work in French, meaning the right to have all communications with one's superiors and coworkers in French, as well as the right not to be required to know another language as a condition of hiring, unless this is warranted by the nature of one's duties, such as by reason of extensive interaction with people located outside the province or similar reasons. This section has not been as effective as had originally been hoped, and has faded somewhat from public consciousness. As of 2006, approximately 65% of the workforce on the island of Montreal predominantly used French in the workplace.
The only other province that recognizes French as an official language is [[New Brunswick]], which is officially bilingual, like the nation as a whole. Outside of [[Quebec]], the highest number of Francophones in Canada, 485,000, excluding those who claim multiple mother tongues, reside in [[Ontario]], whereas [[New Brunswick]], home to the vast majority of [[Acadians]], has the highest ''percentage'' of Francophones after [[Quebec]], 33%, or 237,000. In [[Ontario]], [[Nova Scotia]], [[Prince Edward Island]], and [[Manitoba]], French does not have full official status, although the provincial governments do provide some French-language services in all communities where significant numbers of Francophones live. Canada's three northern territories ([[Yukon]], [[Northwest Territories]], and [[Nunavut]]) all recognize French as an official language as well.
All provinces make some effort to accommodate the needs of their Francophone [[citizen]]s, although the level and quality of French-language service vary significantly from province to province. The Ontario [[French Language Services Act]], adopted in 1986, guarantees French language services in that province in regions where the Francophone population exceeds 10% of the total population, as well as communities with Francophone populations exceeding 5,000, and certain other designated areas; this has the most effect in the north and east of the province, as well as in other larger centres such as [[Ottawa]], [[Toronto]], [[Hamilton, Ontario|Hamilton]], [[Mississauga, Ontario|Mississauga]], [[London, Ontario|London]], [[Kitchener, Ontario|Kitchener]], [[St. Catharines, Ontario|St. Catharines]], [[Greater Sudbury]] and [[Windsor, Ontario|Windsor]]. However, the French Language Services Act does not confer the status of "official bilingualism" on these cities, as that designation carries with it implications which go beyond the provision of services in both languages. The City of Ottawa's language policy (by-law 2001-170) allows employees to work in their official language of choice and be supervised in the language of choice.
Canada has the status of member state in the Francophonie, while the provinces of Quebec and New Brunswick are recognized as participating governments. Ontario is currently seeking to become a full member on its own.
====Haiti====
French is an official language of [[Haiti]], although it is mostly spoken by the [[upper class]], while [[Haitian Creole]] (a [[French-based creole language]]) is more widely spoken as a [[mother tongue]].
====French overseas territories====
French is also the official language in France's overseas territories of [[French Guiana]], [[Guadeloupe]], [[Martinique]], [[Saint Barthélemy]], [[Saint Martin (France)|St. Martin]] and [[Saint-Pierre and Miquelon]].
====The United States====
Although it has no official recognition on a federal level, French is the third most-spoken language in the United States, after [[English language|English]] and [[Spanish language|Spanish]], and the second most-spoken in the states of [[Louisiana]], [[Maine]], [[Vermont]] and [[New Hampshire]]. Louisiana is home to two distinct dialects, [[Cajun French]] and [[Louisiana Creole French|Creole French]]
===Africa===
A majority of the world's French-speaking population lives in Africa. According to the 2007 report by the Organisation internationale de la Francophonie, an estimated 115 million African people spread across 31 francophone African countries can speak French either as a [[first language|first]] or [[second language]].
French is mostly a second language in Africa, but in some areas it has become a first language, such as in the region of [[Abidjan]], [[Côte d'Ivoire]] and in [[Libreville]], [[Gabon]]. It is impossible to speak of a single form of [[African French]], but rather of diverse forms of African French which have developed due to the contact with many indigenous [[African languages]].
In the territories of the [[Indian Ocean]], the French language is often spoken alongside French-derived creole languages, the major exception being [[Madagascar]]. There, a Malayo-Polynesian language ([[Malagasy]]) is spoken alongside French. The French language has also met competition with English since English has been the official language in [[Mauritius]] and the [[Seychelles]] for a long time and has recently become an official language of Madagascar.
[[Sub-Saharan Africa]] is the region where the French language is most likely to expand due to the expansion of education and it is also there the language has evolved most in recent years. Some vernacular forms of French in Africa can be difficult to understand for French speakers from other countries but written forms of the language are very closely related to those of the rest of the French-speaking world.
French is an official language of many African countries, most of them former French or [[Belgian colonial empire|Belgian colonies]]:
:*[[Benin]]
:*[[Burkina Faso]]
:*[[Burundi]]
:*[[Cameroon]]
:*[[Central African Republic]]
:*[[Chad]]
:*[[Comoros]]
:*[[Congo (Brazzaville)]]
:*[[Côte d'Ivoire]]
:*[[Democratic Republic of the Congo]]
:*[[Djibouti]]
:*[[Equatorial Guinea]] (former colony of [[Spain]])
:*[[Gabon]]
:*[[Guinea]]
:*[[Madagascar]]
:*[[Mali]]
:*[[Niger]]
:*[[Rwanda]]
:*[[Senegal]]
:*[[Seychelles]]
:*[[Togo]]
In addition, French is an administrative language and commonly used though not on an official basis in [[Mauritius]] and in the [[Maghreb]] states:
:* [[Mauritania]]
:* [[Algeria]]
:*[[Morocco]]
:*[[Tunisia]].
Various reforms have been implemented in recent decades in Algeria to improve the status of [[Arabic language|Arabic]] relative to French, especially in education.
While the predominant European language in [[Egypt]] is [[English language|English]], French is considered to be a more sophisticated language by some elements of the Egyptian upper and upper-middle classes; for this reason, a typical educated Egyptian will learn French in addition to English at some point in his or her education. The perception of sophistication may be related to the use of French as the [[Noble court|royal court]] language of Egypt during the nineteenth century. Egypt participates in [[La Francophonie]].
French is also the official language of [[Mayotte]] and [[Réunion]], two [[Overseas departments and territories of France|overseas territories]] of France located in the [[Indian Ocean]], as well as an administrative and educational language in [[Mauritius]], along with [[English language|English]].
===Asia===
====Lebanon ====
French was the official language in [[Lebanon]] along with [[Arabic language|Arabic]] until 1941, the country's declaration of independence from [[France]]. French is still seen as an official language by the [[Lebanese people]] as it is widely used by the Lebanese, especially for administrative purposes, and is taught in schools as a primary language along with [[Arabic]].
====Southeast Asia====
French is an administrative language in [[Laos]] and [[Cambodia]]. French was historically spoken by the elite in the leased territory [[Guangzhouwan]] in southern [[China]]. In colonial [[Vietnam]], the elites spoke French and many who worked for the French spoke a French creole known as "[[Tây Bồi]]" (now extinct).
====India====
French has official status in the Indian [[Union Territory]] of [[Puducherry|Pondicherry]], along with the regional language [[Tamil language|Tamil]] and some students of Tamil Nadu may opt French as their third or fourth language (usually behind [[English language|English]], Tamil, [[Hindi]]).
French is also commonly taught as third language in secondary school in most cities of [[Maharashtra]] State including [[Mumbai]] as part of the Secondary (X-SSC) and Higher secondary School (XII-HSC) certificate examinations.
===Oceania===
French is also a second official language of the [[Pacific Island]] nation of [[Vanuatu]], along with France's territories of [[French Polynesia]], [[Wallis & Futuna]] and [[New Caledonia]].
==Dialects==
*[[Acadian French]]
*[[African French]]
*[[Aostan French]]
*[[Belgian French]]
*[[Cajun French]]
*[[Canadian French]]
*[[Cambodian French]]
*Guyana French (see [[French Guiana]])
*[[Indian French]]
*[[Jersey Legal French]]
*[[Lao French]]
*[[Levantine French]] (most commonly referred to as Lebanese French, very similar to [[Maghreb French]])
*[[Louisiana Creole French]]
*[[Maghreb French]] (see also North African French)
*[[Meridional French]]
*[[Metropolitan France|Metropolitan French]]
*[[Caldoche|New Caledonian French]]
*[[Newfoundland French]]
*Oceanic French
*[[Quebec French]]
*[[South East Asian French]]
*[[Swiss French]]
*[[Vietnamese French (dialect)|Vietnamese French]]
*West Indian French
==History==
==Sounds==
{{IPA notice}}
Although there are many French regional accents, only one version of the language is normally chosen as a model for foreign learners, which has no commonly used special name, but has been termed ''[[français neutre]]'' (neutral French).
* Voiced stops (i.e. {{IPA|/b d g/}}) are typically produced fully voiced throughout.
* Voiceless stops (i.e. {{IPA|/p t k/}}) are unaspirated.
* Nasals: The velar nasal {{IPA|/ŋ/}} occurs only in final position in borrowed (usually English) words: parking, camping, swing. The palatal nasal {{IPA|/ɲ/}}can occur in word initial position (e.g. gnon), but it is most frequently found in intervocalic, onset position or word-finally (e.g. montagne).
* Fricatives: French has three pairs of homorganic fricatives distinguished by voicing, i.e. labiodental {{IPA|/f/–/v/}}, dental {{IPA|/s/–/z/}}, and palato-alveolar {{IPA|/ʃ/–/ʒ/}}. Notice that {{IPA|/s/–/z/}} are dental, like the plosives {{IPA|/t/–/d/}}, and the nasal {{IPA|/n/}}.
* French has one rhotic whose pronunciation varies considerably among speakers and phonetic contexts. In general it is described as a voiced uvular fricative as in {{IPA|[ʁu]}} roue "wheel" . Vowels are often lengthened before this segment. It can be reduced to an approximant, particularly in final position (e.g. "fort") or reduced to zero in some word-final positions. For other speakers, a uvular trill is also fairly common, and an apical trill {{IPA|[r]}} occurs in some dialects.
* Lateral and central approximants: The lateral approximant {{IPA|/l/}} is unvelarised in both onset (''lire'') and coda position (''il''). In the onset, the central approximants {{IPA|[w]}}, {{IPA|[ɥ]}}, and {{IPA|[j]}} each correspond to a high vowel, {{IPA|/u/}}, {{IPA|/y/}}, and {{IPA|/i/}} respectively. There are a few minimal pairs where the approximant and corresponding vowel contrast, but there are also many cases where they are in free variation. Contrasts between {{IPA|/j/}} and {{IPA|/i/}} occur in final position as in {{IPA|/pɛj/}} ''paye'' "pay" vs. {{IPA|/pɛi/}} ''pays'' "country".
French pronunciation follows strict rules based on spelling, but French spelling is often based more on history than phonology. The rules for pronunciation vary between dialects, but the standard rules are:
* final consonants: Final single consonants, in particular ''s'', ''x'', ''z'', ''t'', ''d'', ''n'' and ''m'', are normally silent. (The final letters ''c'', ''r'', ''f'' and ''l'', however, are normally pronounced.)
**When the following word begins with a vowel, though, a silent consonant ''may'' once again be pronounced, to provide a ''[[liaison (linguistics)|liaison]]'' or "link" between the two words. Some liaisons are ''mandatory'', for example the ''s'' in ''les amants'' or ''vous avez''; some are ''optional'', depending on [[dialect]] and [[register (linguistics)|register]], for example the first ''s'' in ''deux cents euros'' or ''euros irlandais''; and some are ''forbidden'', for example the ''s'' in ''beaucoup d'hommes aiment''. The ''t'' of ''et'' is never pronounced and the silent final consonant of a noun is only pronounced in the plural and in [[set phrase]]s like ''pied-à-terre''. Note that in the case of a word ending ''d'' as in ''pied-à-terre'', the consonant ''t'' is pronounced instead.
** Doubling a final ''n'' and adding a silent ''e'' at the end of a word (e.g. ''chien'' → ''chienne'') makes it clearly pronounced. Doubling a final ''l'' and adding a silent ''e'' (e.g. ''gentil'' → ''gentille'') adds a [j] sound.
* [[elision (French)|elision]] or vowel dropping: Some monosyllabic function words ending in ''a'' or ''e'', such as ''je'' and ''que'', drop their final vowel when placed before a word that begins with a vowel sound (thus avoiding a [[hiatus (linguistics)|hiatus]]). The missing vowel is replaced by an apostrophe. (e.g. ''je ai'' is instead pronounced and spelt → ''j'ai''). This gives for example the same pronunciation for ''l'homme qu'il a vu'' ("the man whom he saw") and ''l'homme qui l'a vu'' ("the man who saw him").
==Orthography==
* [[Nasal vowel|Nasal]]: ''[[n]]'' and ''[[m]]''. When ''n'' or ''m'' follows a vowel or diphthong, the ''n'' or ''m'' becomes silent and causes the preceding vowel to become nasalized (i.e. pronounced with the soft palate extended downward so as to allow part of the air to leave through the nostrils). Exceptions are when the ''n'' or ''m'' is doubled, or immediately followed by a vowel. The prefixes ''en-'' and ''em-'' are always nasalized. The rules get more complex than this but may vary between dialects.
* [[digraph (orthography)|Digraphs]]: French does not introduce extra letters or [[diacritic]]s to specify its large range of vowel sounds and [[diphthongs]], rather it uses specific combinations of vowels, sometimes with following consonants, to show which sound is intended.
* [[Consonant length|Gemination]]: Within words, double consonants are generally not pronounced as geminates in modern French (but geminates can be heard in the cinema or TV news from as recently as the 1970s, and in very refined elocution they may still occur). For example, ''illusion'' is pronounced {{IPA|[ilyzjɔ̃]}} and not {{IPA|[illyzjɔ̃]}}. But gemination does occur between words. For example, ''une info'' ("a news") is pronounced {{IPA|[ynɛ̃fo]}}, whereas ''une nympho'' ("a nympho") is pronounced {{IPA|[ynnɛ̃fo]}}.
* [[Diacritic|Accents]] are used sometimes for pronunciation, sometimes to distinguish similar words, and sometimes for etymology alone.
**Accents that affect pronunciation
***The [[acute accent]] (''l'accent aigu''), ''é'' (e.g. ''école''—school), means that the vowel is pronounced {{IPA|/e/}} instead of the default {{IPA|/ə/}}.
***The [[grave accent]] (''l'accent grave''), ''è'' (e.g. ''élève''—pupil) means that the vowel is pronounced {{IPA|/ɛ/}} instead of the default {{IPA|/ə/}}.
***The [[circumflex]] (''l'accent circonflexe'') ''ê'' (e.g. ''forêt''—forest) shows that an ''e'' is pronounced {{IPA|/ɛ/}} and that an ''o'' is pronounced {{IPA|/o/}}. In standard French it also signifies a pronunciation of {{IPA|/ɑ/}} for the letter ''a'', but this differentiation is disappearing. In the late 19th century, the circumflex was used in place of ''s'' where that letter was not to be pronounced. Thus, ''forest'' became ''forêt'' and ''hospital'' became'' hôpital''.
***The [[Umlaut (diacritic)|diaeresis]] (''le tréma'') (e.g. ''naïf''—foolish, ''Noël''—Christmas) as in English, specifies that this vowel is pronounced separately from the preceding one, not combined and is not a [[schwa]].
***The [[cedilla]] (''la cédille'') ''ç'' (e.g. ''garçon''—boy) means that the letter ''c'' is pronounced {{IPA|/s/}} in front of the hard vowels ''a'', ''o'' and ''u'' (''c'' is otherwise {{IPA|/k/}} before a hard vowel). ''C'' is always pronounced {{IPA|/s/}} in front of the soft vowels ''e'', ''i'', and ''y'', thus ''ç'' is never found in front of soft vowels.
**Accents with no pronunciation effect
***The circumflex does not affect the pronunciation of the letters ''i'' or ''u'', and in most dialects, ''a'' as well. It usually indicates that an ''s'' came after it long ago, as in ''hôtel''.
***All other accents are used only to distinguish similar words, as in the case of distinguishing the adverbs ''là'' and ''où'' ("there", "where") from the article ''la'' and the conjunction ''ou'' ("the" fem. sing., "or") respectively.
==Grammar==
French grammar shares several notable features with most other Romance languages, including:
* the loss of Latin's [[declension]]s
* only two [[grammatical gender]]s
* the development of grammatical [[article (grammar)|article]]s from Latin [[demonstrative]]s
* new [[tense]]s formed from auxiliaries
French word order is [[Subject Verb Object]], except when the object is a pronoun, in which case the word order is [[Subject Object Verb]]. Some rare archaisms allow for different word orders.
==Vocabulary==
The majority of French words derive from [[Vulgar Latin]] or were constructed from Latin or Greek roots. There are often pairs of words, one form being "popular" (noun) and the other one "savant" (adjective), both originating from Latin. Example:
* brother: ''frère'' / ''fraternel'' < from Latin ''frater''
* finger: ''doigt'' / ''digital'' < from Latin ''digitus''
* faith: ''foi'' / ''fidèle'' < from Latin ''fides''
* cold: ''froid'' / ''frigide'' < from Latin ''frigidus''
* eye: ''œil'' / ''oculaire'' < from Latin ''oculus''
In some examples there is a common word from Vulgar Latin and a more savant word borrowed directly from [[Medieval Latin]] or even [[Ancient Greek]].
* '''Cheval'''—Concours '''équestre'''—'''Hippo'''drome
The French words which have developed from Latin are usually less recognisable than [[Italian language|Italian]] words of Latin origin because as French evolved from [[Vulgar Latin]], the unstressed final [[syllable]] of many words was dropped or elided into the following word.
It is estimated that 12% (4,200) of common French words found in a typical [[dictionary]] such as the ''Petit Larousse'' or ''Micro-Robert Plus'' (35,000 words) are of foreign origin. About 25% (1,054) of these foreign words come from [[English language|English]] and are fairly recent borrowings. The others are some 707 words from [[Italian language|Italian]], 550 from ancient [[Germanic languages]], 481 from ancient [[Gallo-Romance languages]], 215 from [[Arabic language|Arabic]], 164 from [[German language|German]], 160 from [[Celtic languages]], 159 from [[Spanish language|Spanish]], 153 from [[Dutch language|Dutch]], 112 from [[Persian language|Persian]] and [[Sanskrit language|Sanskrit]], 101 from [[Native American languages]], 89 from other [[Asian languages]], 56 from other [[Afro-Asiatic languages]], 55 from [[Slavic languages]] and [[Baltic languages]], 10 for [[Basque language|Basque]] and 144 — about three percent — from other languages.
===Numerals===
The French counting system is partially [[vigesimal]]: [[20 (number)|twenty]] (''{{lang|fr|vingt}}'') is used as a base number in the names of numbers from 60–99. The French word for ''eighty'', for example, is ''{{lang|fr|quatre-vingts}}'', which literally means "four twenties", and ''{{lang|fr|soixante-quinze}}'' (literally "sixty-fifteen") means 75. This reform arose after the [[French Revolution]] to unify the different counting system (mostly vigesimal near the coast, due to Celtic (via [[Basque language|Basque]]) and Viking influence). This system is comparable to the archaic English use of ''score'', as in "fourscore and seven" (87), or "threescore and ten" (70). [[Belgian French]] and [[Swiss French]] are different in this respect. In Belgium and Switzerland 70 and 90 are ''{{lang|fr|septante}}'' and ''{{lang|fr|nonante}}''. In Switzerland, depending on the local dialect, 80 can be ''{{lang|fr|quatre-vingts}}'' (Geneva, Neuchâtel, Jura) or ''{{lang|fr|huitante}}'' (Vaud, Valais, Fribourg). ''Octante'' had been used in Switzerland in the past, but is now considered archaic. In Belgium, however, ''quatre-vingts'' is universally used.
==Writing system==
French is written using the 26 letters of the [[Latin alphabet]], plus five diacritics (the [[circumflex]] accent, [[acute accent]], [[grave accent]], [[Umlaut (diacritic)|diaeresis]], and [[cedilla]]) and the two [[Ligature (typography)|ligatures]] (œ) and (æ).
French spelling, like English spelling, tends to preserve obsolete pronunciation rules. This is mainly due to extreme phonetic changes since the Old French period, without a corresponding change in spelling. Moreover, some conscious changes were made to restore Latin orthography:
* Old French ''doit'' > French ''doigt'' "finger" (Latin ''digitus'')
* Old French ''pie'' > French ''pied'' "foot" (Latin ''pes'' (stem: ''ped-'')
As a result, it is difficult to predict the spelling on the basis of the sound alone. Final consonants are generally silent, except when the following word begins with a vowel. For example, all of these words end in a vowel sound: ''pied'', ''aller'', ''les'', ''finit'', ''beaux''. The same words followed by a vowel, however, may sound the consonants, as they do in these examples: ''beaux-arts'', ''les amis'', ''pied-à-terre''.
On the other hand, a given spelling will almost always lead to a predictable sound, and the [[Académie française]] works hard to enforce and update this correspondence. In particular, a given vowel combination or diacritic predictably leads to one phoneme.
The diacritics have '''phonetic''', '''semantic''', and '''etymological''' significance.
* [[acute accent]] (''é''): Over an ''e'', indicates the sound of a short ''ai'' in English, with no [[diphthong]]. An ''é'' in modern French is often used where a combination of ''e'' and a consonant, usually ''s,'' would have been used formerly: ''écouter'' < ''escouter''. This type of accent mark is called ''accent aigu'' in French.
* [[grave accent]] (''à'', ''è'', ''ù''): Over ''a'' or ''u'', used only to distinguish homophones: ''à'' ("to") vs. ''a'' ("has"), ''ou'' ("or") vs. ''où'' ("where"). Over an ''e'', indicates the sound {{IPA|/ɛ/}}.
* [[circumflex]] (''â'', ''ê'', ''î'', ''ô'', ''û''): Over an ''a'', ''e'' or ''o'', indicates the sound {{IPA|/ɑ/}}, {{IPA|/ɛ/}} or {{IPA|/o/}}, respectively (the distinction ''a'' {{IPA|/a/}} vs. ''â'' {{IPA|/ɑ/}} tends to disappear in many dialects). Most often indicates the historical deletion of an adjacent letter (usually an ''s'' or a vowel): ''château'' < ''castel'', ''fête'' < ''feste'', ''sûr'' < ''seur'', ''dîner'' < ''disner''. It has also come to be used to distinguish homophones: ''du'' ("of the") vs. ''dû'' (past participle of ''devoir'' "to have to do something (pertaining to an act)"; note that ''dû'' is in fact written thus because of a dropped ''e'': ''deu''). (''See [[Use of the circumflex in French]]'')
* [[Umlaut (diacritic)|diaeresis]] or ''tréma'' (''ë'', ''ï'', ''ü'', ''ÿ''): Indicates that a vowel is to be pronounced separately from the preceding one: ''naïve'', ''Noël''. A diaeresis on ''y'' only occurs in some proper names and in modern editions of old French texts. Some proper names in which ''ÿ'' appears include ''Aÿ'' (commune in ''canton de la Marne'' formerly ''Aÿ-Champagne''), ''Rue des Cloÿs'' (alley in the 18th arrondisement of Paris), ''Croÿ'' (family name and hotel on the Boulevard Raspail, Paris), ''[[Château du Feÿ]]'' (near Joigny), ''Ghÿs'' (name of Flemish origin spelt ''Ghijs'' where ''ij'' in handwriting looked like ''ÿ'' to French clerks), ''l'Haÿ-les-Roses'' (commune between Paris and Orly airport), Pierre Louÿs (author), Moÿ (place in ''commune de l'Aisne'' and family name), and ''Le Blanc de Nicolaÿ'' (an insurance company in eastern France). The diaresis on ''u'' appears only in the biblical proper names ''Archélaüs'', ''Capharnaüm'', ''Emmaüs'', ''Ésaü'' and ''Saül''. Nevertheless, since the 1990 orthographic rectifications (which are not applied at all by most French people), the diaeresis in words containing ''guë'' (such as ''aiguë'' or ''ciguë'') may be moved onto the ''u'': ''aigüe'', ''cigüe''. Words coming from German retain the old Umlaut (''ä'', ''ö'' and ''ü'') if applicable but use French pronunciation, such as ''kärcher'' (trade mark of a pressure washer).
* [[cedilla]] (''ç''): Indicates that an etymological ''c'' is pronounced {{IPA|/s/}} when it would otherwise be pronounced /k/. Thus ''je lance'' "I throw" (with ''c'' = {{IPA|[s]}} before ''e''), ''je lan'''ç'''ais'' "I was throwing" (''c'' would be pronounced {{IPA|[k]}} before ''a'' without the cedilla). The c cedilla (ç) softens the hard /k/ sound to /s/ before the vowels '''a''', '''o''' or '''u''', for example '''ça''' /sa/. C cedilla is never used before the vowels '''e''' or '''i''' since these two vowels always produce a soft /s/ sound ('''ce''', '''ci''').
There are two [[ligatures]], which have various origins.
* The ligature ''[[œ]]'' is a mandatory contraction of ''oe'' in certain words. Some of these are native French words, with the pronunciation {{IPA|/œ/}} or {{IPA|/ø/}}, e.g. ''sœur'' "sister" {{IPA|/sœʁ/}}, ''œuvre'' "work (of art)" {{IPA|/œvʁ/}}. Note that it usually appears in the combination ''œu''; ''œil'' is an exception. Many of these words were originally written with the [[Digraph (orthography)|digraph]] ''eu''; the ''o'' in the ligature represents a sometimes artificial attempt to imitate the Latin spelling: Latin ''bovem'' > Old French ''buef''/''beuf'' > Modern French ''bœuf''. ''Œ'' is also used in words of Greek origin, as the Latin rendering of the Greek diphthong ''οι'', e.g. ''cœlacanthe'' "coelacanth". These words used to be pronounced with the vowel {{IPA|/e/}}, but in recent years a spelling pronunciation with {{IPA|/ø/}} has taken hold, e.g. ''œsophage'' {{IPA|/ezɔfaʒ/}} or {{IPA|/øzɔfaʒ/}}. The pronunciation with {{IPA|/e/}} is often seen to be more correct. The ligature œ is not used in some occurrences of the letter combination ''oe'', for example, when ''o'' is part of a prefix (''coexister'').
* The ligature ''[[æ]]'' is rare and appears in some words of Latin and Greek origin like ''ægosome'', ''ægyrine'', ''æschne'', ''cæcum'', ''nævus'' or ''uræus''. The vowel quality is identical to é {{IPA|/e/}}.
French writing, as with any language, is affected by the spoken language. In Old French, the plural for ''animal'' was ''animals''. Common speakers pronounced a ''u'' before a word ending in ''l'' as the plural. This resulted in ''animauls''. As the French language evolved this vanished and the form ''animaux'' (''aux'' pronounced {{IPA|/o/}}) was admitted. The same is true for ''cheval'' pluralized as ''chevaux'' and many others. Also ''castel'' pl. ''castels'' became ''château'' pl. ''châteaux''.
==Samples==
German language
The '''German language''' ({{lang|de|''Deutsch''}} ) is a [[West Germanic languages|West Germanic language]] and one of the world's [[world language|major languages]]. German is closely related to and classified alongside [[English language|English]] and [[Dutch language|Dutch]]. Around the world, German is spoken by approximately 100 million [[First language|native speakers]] and also about 80 million non-native speakers, and [[Standard German]] is widely taught in schools, universities, and [[Goethe Institute]]s worldwide.
==Geographic distribution==
===Europe===
German is spoken primarily in [[Languages of Germany|Germany]] (95%), [[Languages of Austria|Austria]] (89%) and [[Linguistic geography of Switzerland|Switzerland]] (64%) together with [[Liechtenstein]], [[Luxembourg]] ([[D-A-CH-Li-Lux]]) constituting the countries where German is the majority language.
Other European German-speaking communities are found in [[Italy]] ([[Province of Bolzano-Bozen|Bolzano-Bozen]]), in the [[German speaking community in Belgium|East Cantons]] of [[Belgium]], in the [[France|french]] area [[Alsace]] which often was traded between Germany and France in history and in some border villages of the former [[South Jutland County]] (in German, ''Nordschleswig'', in Danish, ''Sønderjylland'') of [[Denmark]].
Some German-speaking communities still survive in parts of [[Romania]], the [[Czech Republic]], [[Poland]], [[Hungary]], and above all [[Russia]] and [[Kazakhstan]], although forced expulsions after World War II and massive emigration to Germany in the 1980s and 1990s have depopulated most of these communities. It is also spoken by German-speaking foreign populations and some of their descendants in [[Portugal]], [[Spain]], Italy, [[Morocco]], [[Egypt]], [[Israel]], [[Cyprus]], [[Turkey]], [[Greece]], [[United Kingdom]], [[Netherlands]], [[Scandinavia]], [[Siberia]] in Russia, Hungary, Romania, [[Bulgaria]], and the former [[Yugoslavia]] ([[Bosnia and Herzegovina|Bosnia]], [[Serbia]], [[Republic of Macedonia|Macedonia]], [[Croatia]] and [[Slovenia]]).
In Luxembourg and the surrounding areas, big parts of the native population speak German dialects, and some people also master standard German (especially in Luxembourg), although in the [[France|French]] regions of [[Alsace]] (German: ''Elsass'') and [[Lorraine (region)|Lorraine]] (German: ''Lothringen'') [[French language|French]] has replaced the local German dialects as the official language, even though it has not been fully replaced on the street.
===Overseas===
Outside of Europe and the former [[Soviet Union]], the largest German-speaking communities are to be found in the [[United States]], [[Canada]], [[Brazil]] and in [[Argentina]] where millions of Germans migrated in the last 200 years; but the vast majority of their descendants no longer speak German. Additionally, German-speaking communities can be found in the former [[List of former German colonies|German colony]] of [[Namibia]] independent from [[South Africa]] since 1990, as well as in the other countries of German emigration such as [[Canada]], [[Mexico]], [[Dominican Republic]], [[Paraguay]], [[Uruguay]], [[Chile]], [[Peru]], [[Venezuela]] (where [[Alemán Coloniero]] developed), South Africa and [[Australia]].
====South America====
In Brazil the largest concentrations of German speakers are in [[Rio Grande do Sul]] (where [[Riograndenser Hunsrückisch]] was developed), [[Santa Catarina (state)|Santa Catarina]], [[Paraná (state)|Paraná]], and [[Espírito Santo]], and large German-speaking descendant communities in Argentina, Uruguay and Chile. In the 20th century, over 100,000 German [[Refugee|political refugees]] and invited entrepreneurs settled in [[Latin America]], such as [[Costa Rica]], [[Panama]], Venezuela and the Dominican Republic to establish German-speaking enclaves, and there is a reportedly small [[German immigration to Puerto Rico]].
====North America====
The United States has the largest concentration of German speakers outside of Europe; an indication of this presence can be found in the names of such villages and towns as [[New Leipzig, North Dakota|New Leipzig]], [[Munich, North Dakota|Munich]], [[Karlsruhe, North Dakota|Karlsruhe]], and [[Strasburg, North Dakota|Strasburg]], [[North Dakota]], and [[New Braunfels]], Texas. Though over the course of the 20th century many of the descendants of 18th and 19th-century immigrants ceased speaking German at home, small populations of elderly (as well as some younger) speakers can be found in [[Pennsylvania]] ([[Amish]], [[Hutterites]], [[Dunkards]] and some [[Mennonites]] historically spoke [[Pennsylvania German language|Pennsylvania Dutch]] (a [[West Central German]] variety) and [[Hutterite German]]), [[Kansas]] (Mennonites and [[Volga German]]s), North Dakota (Hutterite Germans, Mennonites, [[History of Germans in Russia and the Soviet Union|Russian German]]s, Volga Germans, and [[Baltic Germans]]), [[South Dakota]], [[Montana]], [[Texas]] ([[Texas German]]), [[Wisconsin]], [[Indiana]], [[Louisiana]] and [[Oklahoma]]. Early twentieth century immigration was often to [[St. Louis, Missouri|St. Louis]], [[Chicago]], [[New York]], [[Pittsburgh]] and [[Cincinnati]]. Most of the post–[[World War II]] wave are in the New York, [[Philadelphia]], [[Los Angeles]], [[San Francisco]] and Chicago [[urban area]]s, and in [[Florida]], [[Arizona]] and [[California]] where large communities of retired German, Swiss and Austrian expatriates live. The [[German Americans|American population of German ancestry]] is above 60 million. The German language is the third largest language in the U.S. after [[Spanish language|Spanish]].
In Canada there are people of German ancestry throughout the country and especially in the western cities such as [[Kelowna]]. German is also spoken in [[Ontario]] and southern [[Nova Scotia]]. There is a large and vibrant community in the city of [[Kitchener, Ontario]]. German immigrants were instrumental in the country's three largest urban areas: [[Montreal]], [[Toronto]] and [[Vancouver]], but post-WWII immigrants managed to preserve a fluency in the German language in their respective neighborhoods and sections. In the first half of the 20th century, over a million [[German-Canadian]]s made the language one of Canada's most spoken after [[French language|French]].
In Mexico there are also large populations of German ancestry, mainly in the cities of: [[Mexico City]], [[Puebla]], [[Mazatlán]], [[Tapachula]], and larger populations scattered in the states of [[Chihuahua]], [[Durango]], and [[Zacatecas]]. German ancestry is also said to be found in neighboring towns around [[Guadalajara, Jalisco]] and much of Northern Mexico, where German influence was immersed into the Mexican culture. Standard German is spoken by the affluent German communities in Puebla, Mexico City, [[Nuevo Leon]], [[San Luis Potosi]] and [[Quintana Roo]]. German immigration in the twentieth century was small, but produced German-speaking communities in Central America (i.e. [[Guatemala]], [[Honduras]] and [[Nicaragua]]) and the Caribbean Islands like the [[Dominican Republic]].
'''Dialects in North America:'''
The dialects of German which are or were primarily spoken in colonies or communities founded by German speaking people resemble the dialects of the regions the founders came from. For example, Pennsylvania German resembles dialects of the [[Rhenish Palatinate|Palatinate]], and Hutterite German resembles dialects of [[Carinthia (state)|Carinthia]]. [[Texas German]] is a dialect spoken in the areas of Texas settled by the [[Adelsverein]], such as New Braunfels and Fredericksburg. In the [[Amana Colonies]] in the state of Iowa [[Amana German]] is spoken. [[Plautdietsch]] is a large [[minority language]] spoken in Northern Mexico by the [[Mennonite]] communities, and is spoken by more than 200,000 people in Mexico. [[Hutterite German]] is an Upper German dialect of the [[Austro-Bavarian]] variety of the German language, which is spoken by Hutterite communities in Canada and the United States. Hutterite is spoken in the U.S. states of [[Washington]], Montana, North Dakota and South Dakota, and [[Minnesota]]; and in the Canadian provinces of [[Alberta]], [[Saskatchewan]] and [[Manitoba]]. Its speakers belong to some Schmiedleit, Lehrerleit, and Dariusleit Hutterite groups, but there are also speakers among the older generations of Prairieleit (the descendants of those Hutterites who chose not to settle in colonies). Hutterite children who grow up in the colonies learn and speak first Hutterite German before learning English in the public school, the standard language of the surrounding areas. Many colonies though continue with German Grammar School, separate from the public school, throughout a student's elementary education.
====Creoles====
There is an important German creole being studied and recovered, named [[Unserdeutsch]], spoken in the former German colony of [[Papua New Guinea]], across [[Micronesia]] and in northern Australia (i.e. coastal parts of [[Queensland]] and [[Western Australia]]), by few elderly people. The risk of its extinction is serious and efforts to revive interest in the language are being implemented by scholars.
====Internet====
According to [[Global Reach]] (2004), 6.9% of the Internet population is German. According to [[Netz-tipp]] (2002), 7.7% of webpages are written in German, making it second only to English in the European language group. They also report that 12% of Google's users use its German interface.
Older statistics: Babel (1998) found somewhat similar demographics. FUNREDES (1998) and Vilaweb (2000) both found that German is the third most popular language used by websites, after English and Japanese.
==History==
The history of the language begins with the [[High German consonant shift]] during the [[migration period]], separating [[High German]] dialects from common [[West Germanic]]. The earliest testimonies of [[Old High German]] are from scattered [[Elder Futhark]] inscriptions, especially in [[Alemannic]], from the 6th century, the earliest glosses (''[[Abrogans]]'') date to the 8th and the oldest coherent texts (the ''[[Hildebrandslied]]'', the ''[[Muspilli]]'' and the [[Merseburg Incantations]]) to the 9th century. [[Old Saxon]] at this time belongs to the [[Ingvaeonic|North Sea Germanic]] cultural sphere, and [[Low Saxon]] should fall under German rather than [[Anglo-Frisian]] influence during the [[Holy Roman Empire]].
As Germany was divided into many different [[state]]s, the only force working for a unification or [[standard language|standardization]] of German during a period of several hundred years was the general preference of writers trying to write in a way that could be understood in the largest possible area.
When [[Martin Luther]] translated the [[Bible]] (the [[New Testament]] in 1522 and the [[Old Testament]], published in parts and completed in 1534) he based his translation mainly on the bureaucratic standard language used in Saxony (''sächsische Kanzleisprache'') also known as ''Meißner-Deutsch'' (Meißner-German), which was the most widely understood language at this time, because the region it was spoken in was quite influential amongst the German states. This language was based on Eastern Upper and Eastern Central German dialects and preserved much of the grammatical system of Middle High German (unlike the spoken German dialects in Central and Upper Germany that already at that time began to lose the [[genitive case]] and the preterite tense). In the beginning, copies of the Bible had a long list for each region, which translated words unknown in the region into the regional dialect. [[Roman Catholics]] rejected Luther's translation in the beginning and tried to create their own Catholic standard (''gemeines Deutsch'') — which, however, only differed from 'Protestant German' in some minor details. It took until the middle of the 18th century to create a standard that was widely accepted, thus ending the period of [[Early New High German]]. In 1901 the 2nd Orthographical Conference ended with a complete standardization of German language in written form while the ''Deutsche Bühnensprache'' (literally: ''German stage-language'') had already established spelling-rules for German three years earlier which were later to become obligatory for general German pronunciation.
German used to be the language of commerce and government in the [[Habsburg Empire]], which encompassed a large area of Central and Eastern Europe. Until the mid-19th century it was essentially the language of townspeople throughout most of the Empire. It indicated that the speaker was a [[merchant]], an urbanite, not their nationality. Some cities, such as [[Prague]] (German: ''Prag'') and [[Budapest]] ([[Buda]], German: ''Ofen''), were gradually [[Germanization|Germanized]] in the years after their incorporation into the Habsburg domain. Others, such as [[Bratislava]](German: ''Pressburg''), were originally settled during the Habsburg period and were primarily German at that time. A few cities such as [[Milan]] (German: ''Mailand'') remained primarily non-German. However, most cities were primarily German during this time, such as Prague, Budapest, Bratislava (German: ''Pressburg''), [[Zagreb]] (German: ''Agram''), and [[Ljubljana]] (German: ''Laibach''), though they were surrounded by territory that spoke other languages.
Until about 1800, standard German was almost only a written language. At this time, people in urban [[northern Germany]], who spoke dialects very different from Standard German, learned it almost like a foreign language and tried to pronounce it as close to the spelling as possible. Prescriptive pronunciation guides used to consider northern [[German phonology|German pronunciation]] to be the standard. However, the actual pronunciation of standard German varies from region to region.
Media and written works are almost all produced in standard German (often called ''Hochdeutsch'' in German) which is understood in all areas where German is spoken, except by [[Nursery school|pre-school]] children in areas which speak only dialect, for example [[Switzerland]] and [[Austria]]. However, in this age of television, even they now usually learn to understand Standard German before school age.
The first dictionary of the [[Brothers Grimm]], the 16 parts of which were issued between 1852 and 1860, remains the most comprehensive guide to the words of the German language. In 1860, grammatical and orthographic rules first appeared in the ''[[Duden Handbook]]''. In 1901, this was declared the standard definition of the German language. Official revisions of some of these rules were not issued until 1998, when the [[German spelling reform of 1996]] was officially promulgated by governmental representatives of all German-speaking countries. Since the reform, German spelling has been in an eight-year transitional period where the reformed spelling is taught in most schools, while traditional and reformed spellings co-exist in the media. See [[German spelling reform of 1996]] for an overview of the public debate concerning the reform with some major newspapers and magazines and several known writers refusing to adopt it.
The German spelling reform of 1996 led to public controversy indeed to considerable dispute. Some state parliaments (Bundesländer) would not accept it ([[North Rhine-Westphalia|North Rhine Westphalia]] and Bavaria). The dispute landed at one point in the highest court which made a short issue of it, claiming that the states had to decide for themselves and that only in schools could the reform be made the official rule - everybody else could continue writing as they had learned it. After 10 years, without any intervention by the federal parliament, a major yet incomplete revision was installed in 2006, just in time for the new school year of 2006. In 2007, some venerable spellings will be finally invalidated even though they caused little or no trouble. The only sure and easily recognizable symptom of a text's being in compliance with the reform is the -ss at the end of words, like in ''dass'' and ''muss''. Classic spelling forbade this ending, instead using ''daß'' and ''muß''.
The cause of the controversy evolved around the question whether a language is part of the culture which must be preserved or a means of communicating information which has to allow for growth. (The reformers seemed to be unimpressed by the fact that a considerable part of that culture - namely the entire German literature of the 20th century - is in the old spelling.)
The increasing use of English in Germany's higher education system, as well as in business and in popular culture, has led various German academics to state, not necessarily from an entirely negative perspective, that German is a language in decline in its native country. For example, Ursula Kimpel, of the [[University of Tübingen]], said in 2005 that “German universities are offering more courses in English because of the large number of students coming from abroad. German is unfortunately a language in decline. We need and want our professors to be able to teach effectively in English.”
==Standard German==
Standard German originated not as a traditional dialect of a specific region, but as a [[written language]]. However, there are places where the traditional regional dialects have been replaced by standard German; this is the case in vast stretches of Northern Germany, but also in major cities in other parts of the country.
Standard German differs regionally, between German-speaking countries, in [[vocabulary]] and some instances of [[pronunciation]], and even [[grammar]] and [[orthography]]. This variation must not be confused with the variation of local dialects. Even though the regional varieties of standard German are only to a certain degree influenced by the local dialects, they are very distinct. German is thus considered a pluricentric language.
In most regions, the speakers use a continuum of mixtures from more dialectal varieties to more standard varieties according to situation.
In the German-speaking parts of Switzerland, mixtures of dialect and standard are very seldom used, and the use of standard German is largely restricted to the written language. Therefore, this situation has been called a ''medial [[diglossia]]''. [[Swiss Standard German]] is used in the Swiss education system.
===Official status===
Standard German is the only [[official language]] in Liechtenstein and Austria; it shares official status in [[Germany]] (with [[Danish language|Danish]], [[Frisian languages|Frisian]] and [[Sorbian languages|Sorbian]] as minority languages), Switzerland (with [[French language|French]], [[Italian language|Italian]] and [[Romansh language|Romansh]]), Belgium (with [[Dutch language|Dutch]] and French) and Luxembourg (with French and [[Luxembourgish language|Luxembourgish]]). It is used as a local official language in Italy ([[Province of Bolzano-Bozen]]), as well as in the cities of [[Sopron]] (Hungary), Krahule ([[Slovakia]]) and several cities in Romania. It is the official language (with Italian) of the [[Vatican City|Vatican]] [[Swiss Guard]].
German has an officially recognized status as regional or auxiliary language in Denmark ([[South Jutland]] region), France (Alsace and [[Moselle]] regions), Italy (Gressoney valley), Namibia, [[Poland]] ([[Bilingual communes in Poland|Opole]] region), and Russia (Asowo and Halbstadt).
German is one of the 23 official [[languages of the European Union]]. It is the language with the largest number of native speakers in the [[European Union]], and, shortly after English and long before French, the second-most spoken language in Europe.
===German as a foreign language===
German is the third most taught [[foreign language]] in the English speaking world after French and Spanish.
German is the main language of about 90–95 million people in Europe (as of 2004), or 13.3% of all Europeans, being the second most spoken native language in Europe after [[Russian language|Russian]], above French (66.5 million speakers in 2004) and English (64.2 million speakers in 2004). It is therefore the most spoken first language in the EU. It is the second most known foreign language in the EU. It is one of the official languages of the European Union, and one of the three [[working language]]s of [[European Commission|the European Commission]], along with English and French. Thirty-two percent of citizens of the EU-15 countries say they can converse in German (either as a mother tongue or as a second or foreign language). This is assisted by the widespread availability of German TV by cable or satellite.
German was once, and still remains to some extent, a [[lingua franca]] in Central, Eastern and [[Northern Europe]].
==Dialects==
German is a member of the [[West Germanic language|western branch]] of the [[Germanic languages|Germanic]] [[Language family|family of languages]], which in turn is part of the [[Indo-European language family]]. The German dialect continuum is traditionally divided most broadly into [[High German languages|High German]] and Low German.
The variation among the German dialects is considerable, with only the neighbouring dialects being mutually intelligible. Some dialects are not intelligible to people who only know standard German. However, all German dialects belong to the dialect continuum of High German and Low Saxon languages. Until roughly the end of the Second World War, there was a dialect continuum of all the continental West Germanic languages because nearly any pair of neighbouring dialects were perfectly mutually intelligible.
=== Low German ===
Low Saxon varieties (spoken on German territory) are considered linguistically a language separate from the German language by some, but just a dialect by others. Sometimes, Low Saxon and [[Low Franconian]] are grouped together because both are unaffected by the High German consonant shift. However, the part of the population capable of speaking and responding to it, or of understanding it has decreased continuously since WWII. Currently the effort to maintain a residual presence in cultural life is negligible. [[Middle Low German]] was the [[lingua franca]] of the [[Hanseatic League]]. It was the predominant language in Northern Germany. This changed in the 16th century. In 1534 the [[Luther Bible]] by Martin Luther was printed. This translation is considered to be an important step towards the evolution of the Early New High German. It aimed to be understandable to an ample audience and was based mainly on Central and [[Upper German]] varieties. The Early New High German language gained more prestige than Low Saxon and became the language of science and literature. Other factors were that around the same time, the Hanseatic league lost its importance as new trade routes to [[Asia]] and the [[Americas]] were established, and that the most powerful German states of that period were located in Middle and Southern Germany.
The 18th and 19th centuries were marked by mass [[education]], the language of the schools being standard German. Slowly Low Saxon was pushed back and back until it was nothing but a language spoken by the uneducated and at home. Today Low Saxon can be divided in two groups: Low Saxon varieties with a reasonable standard German influx and varieties of Standard German with a Low Saxon influence known as [[Missingsch]].
=== High German ===
High German is divided into [[Central German]] and [[Upper German language|Upper German]]. Central German dialects include [[Ripuarian]], [[Moselle Franconian]], [[Hessian language|Hessian]], [[Thuringian]], [[South Franconian]], [[Lorraine Franconian]] and [[Upper Saxon dialect|Upper Saxon]]. It is spoken in the southeastern Netherlands, eastern Belgium, Luxembourg, parts of France, and in Germany approximately between the River [[Main]] and the southern edge of the Lowlands. Modern Standard German is mostly based on Central German, but it should be noted that the common (but not linguistically correct) German term for modern Standard German is ''Hochdeutsch'', that is, ''High German''.
The Moselle Franconian varieties spoken in Luxembourg have been officially standardised and institutionalised and are therefore usually considered a separate language known as [[Luxembourgish language|Luxembourgish]].
Upper German dialects include [[Alemannic German|Alemannic]] (for instance [[Swiss German (linguistics)|Swiss German]]), [[Swabian German|Swabian]], [[East Franconian German|East Franconian]], [[Alsatian]] and [[Austro-Bavarian]]. They are spoken in parts of the Alsace, southern Germany, Liechtenstein, Austria, and in the German-speaking parts of Switzerland and Italy. [[Wymysorys]], [[Sathmarisch]] and [[Siebenbürgisch]] are High German dialects of Poland and Romania respectively. The High German varieties spoken by [[Ashkenazi Jew]]s (mostly in the former [[Soviet Union]]) have several unique features, and are usually considered as a separate language, [[Yiddish]]. It is the only Germanic language that does not use the [[Latin alphabet]] as its [[official script|standard script]].
===German dialects versus varieties of standard German===
In German [[linguistics]], German [[dialect]]s are distinguished from [[variety (linguistics)|varieties]] of [[standard German]].
*The ''German dialects'' are the traditional local varieties. They are traditionally traced back to the different German tribes. Many of them are hardly understandable to someone who knows only standard German, since they often differ from standard German in [[lexicon]], [[phonology]] and [[syntax]]. If a narrow definition of [[language]] based on [[mutual intelligibility]] is used, many German dialects are considered to be separate languages (for instance in the [[Ethnologue]]). However, such a point of view is unusual in German linguistics.
*The ''varieties of standard German'' refer to the different local varieties of the [[pluricentric language|pluricentric]] standard German. They only differ slightly in lexicon and phonology. In certain regions, they have replaced the traditional German dialects, especially in Northern Germany.
==Grammar==
German is an [[Fusional language|inflected language]].
===Noun inflection===
[[German nouns]] inflect into:
* one of four [[Grammatical case|case]]s: [[nominative]], [[genitive]], [[dative case|dative]], and [[accusative case|accusative]].
* one of three [[grammatical gender|genders]]: masculine, feminine, or neuter. Word endings sometimes reveal grammatical gender; for instance, nouns ending in '''...ung'''([[-ing]]), '''...e''','''...schaft'''([[-ship]]), '''...keit''' or '''...heit'''([[-hood]]) are feminine, while nouns ending in '''...chen''' or '''...lein''' ([[diminutive]] forms) are neuter and nouns ending in '''...ismus ([[-ism]])''' are masculine. Others are controversial, sometimes depending on the region in which it is spoken. Additionally, ambiguous endings exist, such as '''...er''' ([[-er]]), e.g. ''Feier (feminine)'', engl. ''celebration, party'', and ''Arbeiter (masculine)'', engl. ''labourer''. Sentences can usually be reorganized to avoid a misunderstanding.
* two numbers: singular and plural
Although German is usually cited as an outstanding example of a highly inflected language, the degree of inflection is considerably less than in [[Old German]], or in other old [[Indo-European languages]] such as [[Latin]], [[Ancient Greek]], or [[Sanskrit]]. The three genders have collapsed in the plural, which now behaves, grammatically, somewhat as a fourth gender. With four cases and three genders plus plural there are 16 distinct possible combinations of case and gender/number, but presently there are only six forms of the [[Article (grammar)|definite article]] used for the 16 possibilities. Inflection for case on the noun itself is required in the singular for strong masculine and neuter nouns in the genitive and sometimes in the dative. Both of these cases are losing way to substitutes in [[Natural language|informal speech]]. The dative ending is considered somewhat old-fashioned in many contexts and often dropped, but it is still used in sayings and in formal speech or in written language. Weak masculine nouns share a common case ending for genitive, dative and accusative in the singular. Feminines are not declined in the singular. The plural does have an inflection for the dative. In total, seven inflectional endings (not counting plural markers) exist in German: ''-s, -es, -n, -ns, -en, -ens, -e''.
In the German orthography, nouns and most words with the syntactical function of nouns are capitalised, which is supposed to make it easier for readers to find out what function a word has within the sentence (''Am Freitag bin ich einkaufen gegangen.'' — "On Friday I went shopping."; ''Eines Tages war er endlich da.'' — "One day he finally showed up".) This spelling convention is almost unique to German today (shared perhaps only by the closely related [[Luxemburgish language]]), although it was historically common in other languages (e.g., Danish and English), too.
Like most Germanic languages, German forms left-branching noun [[compound (linguistics)|compound]]s, where the first noun modifies the category given by the second, for example: ''Hundehütte'' (eng. ''dog hut''; specifically: ''doghouse''). Unlike English, where newer compounds or combinations of longer nouns are often written in ''open'' form with separating spaces, German (like the other German languages) nearly always uses the ''closed'' form without spaces, for example: Baumhaus (eng. ''tree house''). Like English, German allows arbitrarily long compounds, but these are rare. (''See also'' [[English compounds]].)
The longest German word verified to be actually in (albeit very limited) use is [[Rinderkennzeichnungs- und Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz]].
[which, literally translated, breaks up into: Rind (cattle) - Fleisch (meat) - Etikettierung(s) (labelling) - Überwachung(s) (supervision) - Aufgaben (duties) - Übertragung(s) (assignment) - Gesetz (law), so "Beef labelling supervision duty assignment law".]
===Verb inflection===
Standard German verbs inflect into:
* one of two conjugation classes, [[weak verb|weak]] and [[strong verb|strong]] (like English).
(There is actually a third class, known as mixed verbs, which exhibit inflections combining features of both the strong and weak patterns.)
* three persons: 1st, 2nd, 3rd.
* two numbers: singular and plural
* three [[Grammatical mood|mood]]s: Indicative, Subjunctive, Imperative
* two [[Grammatical voice|genera verbi]]: active and passive; the passive being composed and dividable into static and dynamic.
* two non-composed tenses ([[present tense|present]], [[preterite]]) and four composed tenses ([[perfect tense|perfect]], [[pluperfect]], [[Future tense|future]] and [[Future perfect tense|future perfect]])
* distinction between [[grammatical aspect]]s is rendered by combined use of subjunctive and/or preterite marking; thus: neither of both is plain indicative voice, sole subjunctive conveys second-hand information, subjunctive plus Preterite marking forms the conditional state, and sole preterite is either plain indicative (in the past), or functions as a (literal) alternative for either second-hand-information or for the conditional state of the verb, when one of them may seem indistinguishable otherwise.
* distinction between perfect and [[Continuous and progressive aspects|progressive aspect]] is and has at every stage of development been at hand as a productive category of the older language and in nearly all documented dialects, but, strangely enough, is nowadays rigorously excluded from written usage in its present normalised form.
* disambiguation of completed vs. uncompleted forms is widely observed and regularly generated by common prefixes (blicken - to look, erblicken - to see [unrelated form: sehen - to see]).
====Verb prefixes====
There are also many ways to expand, and sometimes radically change, the meaning of a base verb through a relatively small number of prefixes. Some of those prefixes have a meaning themselves (Example: zer- refers to the destruction of things, as in zerreißen = to tear apart, zerbrechen = to break apart, zerschneiden = to cut apart), others do not have more than the vaguest meaning in and of themselves (Example: ver- , as in versuchen = to try, vernehmen = to interrogate, verteilen = to distribute, verstehen = to understand). More examples: haften = to stick, verhaften = to imprison; kaufen = to buy, verkaufen = to sell; hören = to hear, aufhören = to cease; fahren = to drive, erfahren = to get to know, to hear about something.
=====Separable prefixes=====
Many [[German verbs]] have a separable prefix, often with an adverbial function. In [[finite verb]] forms this is split off and moved to the end of the clause, and is hence considered by some to be a "resultative particle". For example, ''mitgehen'' meaning "to go with" would be split giving ''Gehen Sie mit?'' (Literal: "Go you with?" ; Formal: "Are you going along"?).
Indeed, several [[parenthetic]]al clauses may occur between the prefix of a finite verb and its complement; e.g.
:''Er '''kam''' am Freitagabend nach einem harten Arbeitstag und dem üblichen Ärger, der ihn schon seit Jahren immer wieder an seinem Arbeitsplatz plagt, mit fraglicher Freude auf ein Mahl, das seine Frau ihm, wie er hoffte, bereits aufgetischt hatte, endlich zu Hause '''an''' ''.
A literal translation of this example might look like this:
:He '''arr-''' on a Friday evening after a hard day at work and the usual disagreements that had been troubling him repeatedly, looking forward to a questionable meal which, as he hoped, his wife had already fixed for him, '''-ived''' at home.
===Word order===
German requires that a verbal element (main verb or [[auxiliary verb]]) appear second in the sentence, preceded by the most important topical phrase. The second most important phrase appears at the end of the sentence. For a sentence without an auxiliary, this gives several options:
: ''{{lang|de|Der alte Mann gibt mir das Buch heute.}}'' (The old man gives me the book today)
: ''{{lang|de|Der alte Mann gibt mir heute das Buch.}}''
: ''{{lang|de|Das Buch gibt mir der alte Mann heute.}}''
: ''{{lang|de|Das Buch gibt der alte Mann heute mir.}}'' ([[stress (linguistics)|stress]] on ''mir'')
: ''{{lang|de|Das Buch gibt heute der alte Mann mir.}}'' (as well)
: ''{{lang|de|Das Buch gibt der alte Mann mir heute.}}''
: ''{{lang|de|Das Buch gibt heute mir der alte Mann.}}''
: ''{{lang|de|Das Buch gibt mir heute der alte Mann.}}''
: ''{{lang|de|Heute gibt mir der alte Mann das Buch.}}''
: ''{{lang|de|Heute gibt mir das Buch der alte Mann.}}''
: ''{{lang|de|Heute gibt der alte Mann mir das Buch.}}''
: ''{{lang|de|Mir gibt der alte Mann das Buch heute.}}''
: ''{{lang|de|Mir gibt heute der alte Mann das Buch.}}''
: ''{{lang|de|Mir gibt der alte Mann heute das Buch.}}''
The position of a noun as a subject or object in a German sentence doesn't affect the meaning of the sentence as it would in English. In a [[Sentence (linguistics)|declarative sentence]] in English if the subject does not occur before the predicate the sentence could well be misunderstood.
For example, in the sentence "Man bites dog" it is clear who did what to whom. To exchange the place of the subject with that of the object — "Dog bites man" — changes the meaning completely. In other words the word order in a sentence conveys significant information. In German, nouns and articles are declined as in Latin thus indicating whether it is the [[subject (linguistics)|subject]] or [[object (linguistics)|object]] of the verb's action. The above example in German would be ''{{lang|de|Ein Mann beißt den Hund}}'' or ''{{lang|de|Den Hund beißt ein Mann}}'' with both having exactly the same meaning. If the articles are omitted, which is sometimes done in headlines (''{{lang|de|Mann beißt Hund}}''), the syntax applies as in English — the first noun is the subject and the noun following the predicate is the object.
Except for emphasis, adverbs of time have to appear in the third place in the sentence, just after the predicate. Otherwise the speaker would be recognised as non-German. For instance the German word order (in Modern English) is: We're going tomorrow to town. (''{{lang|de|Wir gehen morgen in die Stadt.}}'')
====Auxiliary verbs====
When an [[auxiliary verb]] is present, the auxiliary appears in second position, and the main verb appears at the end. This occurs notably in the creation of the [[perfect tense]]. Many word orders are still possible, e.g.:
:''{{lang|de|Der alte Mann hat mir das Buch gestern gegeben.}}'' (The old man gave me the book yesterday.)
:''{{lang|de|Der alte Mann hat mir gestern das Buch gegeben.}}''
:''{{lang|de|Das Buch hat mir der alte Mann gestern gegeben.}}''
:''{{lang|de|Das Buch hat mir gestern der alte Mann gegeben.}}''
:''{{lang|de|Gestern hat mir der alte Mann das Buch gegeben.}}''
:''{{lang|de|Gestern hat mir das Buch der alte Mann gegeben.}}''
The word order is generally less rigid than in Modern English except for nouns (see below). There are two common [[word order]]s; one is for main [[clause]]s and another for [[subordinate clause]]s. In normal positive sentences the ''inflected'' verb always has position 2; in questions, exclamations and wishes it always has position 1. In subordinate clauses the verb is supposed to occur at the very end, but in speech this rule is often disregarded. For example in a [[Dependent clause|subordinate clause]] introduced by "weil" ("because") the verb quite often occupies the same order as in a [[Independent clause|main clause]]. The correct way of saying "because I'm broke" is ''"{{lang|de|…weil ich pleite bin.}}"''. In the vernacular you may hear instead ''"{{lang|de|…weil ich bin pleite.}}"'' This phenomenon may be caused by mixing the word-order pattern used for the word ''{{lang|de|weil}}'' with the pattern used for an alternative word for "because", ''{{lang|de|denn}}'', which is used with the main clause order (''"{{lang|de|…denn ich bin pleite.}}"'').
====Modal verbs====
Sentences using modal verbs place the infinitive at the end. For example, the sentence in Modern English "Should he go home?" would be rearranged in German to say "Should he (to) home go?" (''{{lang|de|Soll er nach Hause gehen?}}''). Thus in sentences with several subordinate or relative clauses the infinitives are clustered at the end. Compare the similar clustering of prepositions in the following English sentence: "What did you bring that book that I don't like to be read to out of up for?"
====Multiple infinitives====
The number of infinitives at the end is usually restricted to two, causing the third infinitive or auxiliary verb that would have gone at the very end to be placed instead at the beginning of the chain of verbs. For example in the sentence "Should he move into the house that he just has had renovated?" would be rearranged to "Should he into the house move, that he just renovated had?". (''{{lang|de|Soll er in das Haus einziehen, das er gerade hat renovieren lassen?}}''). The older form would have been (''{{lang|de|Soll er in das Haus, das er gerade hat renovieren lassen, einziehen?}}'').
If there are more than three infinitives, all except the first two are relocated to the beginning of the chain. Needless to say the rule is not rigorously applied.
==Vocabulary==
Most German vocabulary is derived from the Germanic branch of the Indo-European language family, although there are significant minorities of words derived from Latin, and [[Greek language|Greek]], and a smaller amount from French and most recently English . At the same time, the effectiveness of the German language in forming equivalents for foreign words from its inherited Germanic stem repertory is great. Thus, [[Notker Labeo]] was able to translate Aristotelian treatises in pure (Old High) German in the decades after the year 1000. Overall, German has fewer Romance-language loanwords than does English.
The coining of new, autochthonous words gave German a vocabulary of an estimated 40,000 words as early as the ninth century. In comparison, Latin, with a written tradition of nearly 2,500 years in an empire which ruled the Mediterranean, has grown to no more than 45,000 words today.
Even today, many low-key scholarly movements try to promote the ''[[Ersatz]]'' (substitution) of virtually all foreign words with ancient, dialectal, or [[neologism|neologous]] German alternatives. It is claimed that this would also help in spreading modern or scientific notions among the less educated, and thus democratise public life, too. Jurisprudence in Germany, for example, uses perhaps the "purest" tongue in terms of "Germanness", but also the most cumbersome, to be found today..
In the modern scientific German vocabulary data base in Leipzig (as of July 2003) there are nine million words and word groups in 35 million sentences (out of a corpus of 500 million words).
==Writing system==
=== Present ===
German is written using the Latin alphabet. In addition to the 26 standard letters, German has three vowels with [[Umlaut (diacritic)|Umlaut]], namely ''ä'', ''ö'' and ''ü'', as well as the Eszett or ''[[scharfes s]]'' (sharp s), ''[[ß]]''.
Before the German spelling reform of 1996, ''ß'' replaced ''ss'' after [[Vowel length|long vowels]] and diphthongs and before consonants, word-, or partial-word-endings. In reformed spelling, ''ß'' replaces ''ss'' only after long vowels and diphthongs. Since there is no [[capital ß]], it is always written as SS when capitalization is required. For example, ''Maßband'' (tape measure) is capitalized ''MASSBAND''. An exception is the use of ß in legal documents and forms when capitalizing names. To avoid confusion with similar names, a "ß" is to be used instead of "SS". (So: "KREßLEIN" instead of "KRESSLEIN".) A capital ß has been proposed and included in [[Unicode]], but it is not yet recognized as standard German. In [[Switzerland]], ß is not used at all.
Umlaut vowels (ä, ö, ü) are commonly circumscribed with ae, oe, and ue if the umlauts are not available on the keyboard used. In the same manner ß can be circumscribed as ss. German readers understand those circumscriptions (although they look unusual), but they are avoided if the regular umlauts are available because they are considered a makeshift, not proper spelling. (In Westphalia, city and family names exist where the extra e has a vowel lengthening effect, e.g. ''Raesfeld'' [ˈraːsfɛlt] and ''Coesfeld'' [ˈkoːsfɛlt], but this use of the letter e after a/o/u does not occur in the present-day spelling of words other than [[proper noun]]s.)
Unfortunately there is still no general agreement exactly where these umlauts occur in the sorting sequence. Telephone directories treat them by replacing them with the base vowel followed by an e, whereas dictionaries use just the base vowel. As an example in a [[Telephone directory|telephone book]] ''Ärzte'' occurs after ''Adressenverlage'' but before ''Anlagenbauer'' (because Ä is replaced by Ae). In a dictionary ''Ärzte'' occurs after ''Arzt'' but before ''Asbest'' (because Ä is treated as A). In some older dictionaries or indexes, initial ''Sch'' and ''St'' are treated as separate letters and are listed as separate entries after ''S''.
=== Past ===
Until the early 20th century, German was mostly printed in [[blackletter]] [[typefaces]] (mostly in [[fraktur (typeface)|Fraktur]], but also in [[Schwabacher]]) and written in corresponding [[Penmanship|handwriting]] (for example [[Kurrent]] and [[Sütterlin]]). These variants of the Latin alphabet are very different from the serif or [[Sans-serif|sans serif]] [[Antiqua]] typefaces used today, and particularly the handwritten forms are difficult for the untrained to read. The printed forms however were claimed by some to be actually more readable when used for printing [[Germanic language]]s . The [[Nazis]] initially promoted Fraktur and Schwabacher since they were considered [[Aryan]], although they later abolished them in 1941 by claiming that these letters were Jewish. The latter fact is not widely known anymore; today the letters are often associated with the Nazis and are no longer commonly used . The Fraktur script remains present in everyday life through road signs, pub signs, beer brands and other forms of advertisement, where it is used to convey a certain rusticality and oldness.
A proper use of the [[long s]], (''langes s''), [[Long s|ſ]], is essential to write German text in [[Fraktur (script)|Fraktur]] typefaces. Many [[Antiqua script|Antiqua]] typefaces include the [[long s]], also. A specific set of rules applies for the use of long s in German text, but it is rarely used in Antiqua typesetting, recently. Any lower case "s" at the beginning of a syllable would be a long s, as opposed to a terminal s or short s (the more common variation of the letter s), which marks the end of a syllable; for example, in differentiating between the words ''Wachſtube'' (=guard-house) and ''Wachstube'' (=tube of floor polish). One can decide which "s" to use by appropriate hyphenation, easily ("Wach-ſtube" vs. "Wachs-tube"). The long s only appears in [[lower case]].
The widespread ignorance of the correct use of the Fraktur scripts shows however in the many mistakes made— such as the frequent erroneous use of the round s instead of the [[long s]] at the beginning of a syllable, the failure to employ the mandatory [[Typographical ligature|ligature]]s of Fraktur, or the use of letter-forms more alike to the Antiqua for certain especially hard-to-read Fraktur letters.
==Phonology==
===Vowels===
German vowels (excluding diphthongs; see below) come in ''short'' and ''long'' varieties, as detailed in the following table:
Short {{IPA|/ɛ/}} is realised as {{IPA|[ɛ]}} in stressed syllables (including [[secondary stress]]), but as {{IPA|[ǝ]}} in unstressed syllables. Note that stressed short {{IPA|/ɛ/}} can be spelled either with ''e'' or with ''ä'' (''hätte'' 'would have' and ''Kette'' 'chain', for instance, rhyme). In general, the short vowels are open and the long vowels are closed. The one exception is the open {{IPA|/ɛː/}} sound of long Ä; in some varieties of standard German, {{IPA|/ɛː/}} and {{IPA|/eː/}} have merged into {{IPA|[eː]}}, removing this anomaly. In that case, pairs like ''Bären/Beeren'' 'bears/berries' or ''Ähre/Ehre'' 'spike/honour' become homophonous).
In many varieties of standard German, an unstressed {{IPA|/ɛr/}} is not pronounced as {{IPA|[ər]}}, but vocalised to {{IPA|[ɐ]}}.
Whether any particular vowel letter represents the long or short phoneme is not completely predictable, although the following regularities exist:
* If a vowel (other than ''i'') is at the end of a syllable or followed by a single consonant, it is usually pronounced long (e.g. ''Hof'' [hoːf]).
* If the vowel is followed by a double consonant (e.g. ''ff'', ''ss'' or ''tt''), ''ck'', ''tz'' or a [[consonant cluster]] (e.g. ''st'' or ''nd''), it is nearly always short (e.g. ''hoffen'' [ˈhɔfǝn]). Double consonants are used only for this function of marking preciding vowels as short; the consonant itself is never pronounced lengthened or doubled.
Both of these rules have exceptions (e.g. ''hat'' [hat] 'has' is short despite the first rule; ''Kloster'' {{IPA|[kloːstər]}}, '[[cloister]]'; ''Mond'' {{IPA|[moːnt]}}, '[[moon]]' are long despite the second rule). For an ''i'' that is neither in the combination ''ie'' (making it long) nor followed by a double consonant or cluster (making it short), there is no general rule. In some cases, there are regional differences: In central Germany (Hessen), the ''o'' in the [[Noun#Proper nouns and common nouns|proper name]] "Hoffmann" is pronounced long while most other Germans would pronounce it short; the same applies to the ''e'' in the geographical name "Mecklenburg" for people in that region. The word ''Städte'' 'cities', is pronounced with a short vowel {{IPA|[ˈʃtɛtə]}} by some (Jan Hofer, ARD Television) and with a long vowel {{IPA|[ˈʃtɛːtə]}} by others (Marietta Slomka, ZDF Television). Finally, a vowel followed by ''ch'' can be short (''Fach'' {{IPA|[fax]}} 'compartment', ''Küche'' {{IPA|[ˈkʏçe]}} 'kitchen') or long (''Suche'' {{IPA|[ˈzuːxǝ]}} 'search', ''Bücher'' {{IPA|[ˈbyːçər]}} 'books') almost at random. Thus, ''Lache'' is homographous: {{IPA|[la:xe]}} 'puddle' and {{IPA|[laxe]}} 'manner of laughing' (coll.), 'laugh!' (Imp.).
German vowels can form the following digraphs (in writing) and diphthongs (in pronunciation); note that the pronunciation of some of them (ei, äu, eu) is very different from what one would expect when considering the component letters:
Additionally, the digraph ''ie'' generally represents the phoneme {{IPA|/iː/}}, which is not a diphthong. In many varieties, a /r/ at the end of a syllable is vocalised. However, a sequence of a vowel followed by such a vocalised /r/ is not considered a diphthong: Bär {{IPA|[bɛːɐ̯]}} 'bear', er {{IPA|[eːɐ̯]}} 'he', wir {{IPA|[viːɐ̯]}} 'we', Tor {{IPA|[toːɐ̯]}} 'gate', kurz {{IPA|[kʊɐ̯ts]}} 'short', Wörter {{IPA|[vœɐ̯tɐ]}} 'words'.
In most varieties of standard German, word stems that begin with a vowel are preceded by a [[glottal stop]] [ʔ].
===Consonants===
* '''c''' standing by itself is not a German letter. In borrowed words, it is usually pronounced [ʦ] (before ä, äu, e, i, ö, ü, y) or [k] (before a, o, u, or before consonants). The combination '''ck''' is, as in English, used to indicate that the preceding vowel is short.
* '''ch''' occurs most often and is pronounced either [ç] (after ä, ai, äu, e, ei, eu, i, ö, ü and after consonants) or [x] (after a, au, o, u). Ch never occurs at the beginning of an originally German word. In borrowed words with initial Ch there is no single agreement on the pronunciation. For example, the word ''"Chemie"'' (chemistry) can be pronounced [keːˈmiː], [çeːˈmiː] or [ʃeːˈmiː] depending on dialect.
* '''dsch''' is pronounced ʤ (like ''j'' in ''Jungle'') but appears in a few [[loanwords]] only.
* '''f''' is pronounced [f] as in "''f''ather".
* '''h''' is pronounced [h] like in "''h''ome" at the beginning of a syllable. After a vowel it is silent and only lengthens the vowel (e.g. ''"Reh"'' = [[roe deer]]).
* '''j''' is pronounced [j] in Germanic words (''"Jahr"'' [jaːɐ]). In younger loanwords, it follows more or less the respective languages' pronunciations.
* '''l''' is always pronounced [l], never [ɫ] (the English "[[Dark L]]").
* '''q''' only exists in combination with '''u''' and appears both in Germanic and Latin words (''"quer"''; ''"Qualität"''). It is pronounced [kv].
* '''r''' is pronounced as a [[Guttural R|guttural sound]] (an [[uvular trill]], [ʀ]) in front of a vowel or consonant (''"Rasen"'' [ʀaːzən]; ''"Burg"'' like [buʀg]). In spoken German however, it is commonly vocalised after a vowel (''"er"'' being pronounced rather like ['ɛɐ] - ''"Burg"'' [buɐg]). In some southern non-standard varieties, the '''r''' is pronounced as a tongue-tip r (the [[alveolar trill]]).
* '''s''' in Germany, is pronounced [z] (as in "''Z''ebra") if it forms the [[syllable onset]] (e.g. Sohn [zoːn]), otherwise [s] (e.g. Bus [bʊs]). In Austria, always pronounced [s]. A '''ss''' [s] indicates that the preceding vowel is short. '''st''' and '''sp''' at the beginning of words of German origin are pronounced [ʃt] and [ʃp], respectively.
* '''ß''' (a letter unique to German called "Esszet") was a ligature of a double '''s''' ''and'' of a '''sz''' and is always pronounced [s]. Originating in [[Blackletter]] typeface, it traditionally replaced '''ss''' at the end of a syllable (e.g. ''"ich muss"'' → ''"ich muß"''; ''"ich müsste"'' → ''"ich müßte"''); within a word it contrasts with '''ss''' [s] in indicating that the preceding vowel is long (compare ''"in Maßen"'' [in 'maːsən] "with moderation" and ''"in Massen"'' [in 'masən] "in loads"). The use of '''ß''' has recently been limited by the latest German spelling reform and is no longer used for '''ss''' at the end of a syllable; Switzerland and Liechtenstein already abolished it in 1934.
* '''sch''' is pronounced [ʃ] (like "sh" in "Shine").
* '''v''' is pronounced [f] in words of Germanic origin (e.g. ''"Vater"'' [ˈfaːtɐ]) and [v] in most other words (e.g. ''"Vase"'' [ˈvaːzǝ]).
* '''w''' is pronounced [v] like in "''v''acation" (e.g. ''"was"'' [vas]).
* '''y''' only appears in loanwords and is traditionally considered a vowel.
* '''z''' is always pronounced [ʦ] (e.g. ''"zog"'' [ʦoːk]). A '''tz''' indicates that the preceding vowel is short.
====Consonant shifts====
German does not have any [[dental fricative]]s (as English '''th'''). The '''th''' sounds, which the English language has inherited from [[Anglo-Saxons|Anglo Saxon]], survived on the continent up to Old High German and then disappeared in German with the consonant shifts between the 8th and the 10th century. It is sometimes possible to find parallels between German by replacing the English '''th''' with '''d''' in German: "Thank" → in German "Dank", "this" and "that" → "dies" and "das", "[[thou]]" (old 2nd person singular pronoun) → "du", "think" → "denken", "thirsty" → "durstig" and many other examples.
Likewise, the '''gh''' in [[Germanic languages|Germanic]] English words, pronounced in several different ways in modern English (as an '''f''', or not at all), can often be linked to German '''ch''': "to laugh" → "lachen", "through" and "thorough" → "durch", "high" → "hoch", "naught" → "nichts", etc.
==Cognates with English==
There are many thousands of German words that are [[cognate]] to English words (in fact a sizeable fraction of native German and English vocabulary, although for various reasons much of it is not immediately obvious). Most of the words in the following table have almost the same meaning as in English.
Compound word cognates
When these cognates have slightly different consonants, this is often due to the High German consonant shift.
Hence the affinity of English words with those of German dialects is more evidently:
There are cognates whose meanings in either language have changed through the centuries. It is sometimes difficult for both English and German speakers to discern the relationship. On the other hand, once the definitions are made clear, then the logical relation becomes obvious. Sometimes the generality or specificity of word pairs may be opposite in the two languages.
German and English also share many borrowings from other languages, especially Latin, French and Greek. Most of these words have the same meaning, while a few have subtle differences in meaning. As many of these words have been borrowed by numerous languages, not only German and English, they are called ''[[internationalism (linguistics)|internationalisms]]'' in German linguistics. For reference, a good number of these borrowed words are of the neuter gender.
==Words borrowed by English==
:''For a list of German loanwords in English, see [[:Category:German loanwords]]''
In the English language, there are also many words taken from German without any letter change, e.g.:
==Names for German in other languages==
:''See also: [[Deutsch]], [[Names for the Dutch language|Dutch]], [[Deitsch]], [[Dietsch]], [[Teuton]], [[Teutonic]], [[Allemanic]], [[Alleman]], [[Theodisca]]''
The names that countries have for the language differ from region to region.
In Italian the sole name for German is still ''tedesco'', from the Latin ''[[theodiscus]]'', meaning "vernacular".
A possible explanation for the use of words meaning "mute" (e.g., ''nemoj'' in Russian, ''němý'' in Czech, ''nem'' in [[Serbian language|Serbian]]) to refer to German (and also to Germans) in Slavic languages is that Germans were the first people [[Slavic peoples|Slavic tribes]] encountered with whom they could not communicate. [[Romanian language|Romanian]] used to use the Slavonic term "nemţeşte", but "germană" is now widely used. Hungarian "német" is also of Slavonic origin. The [[Arabic language|Arabic]] name for Austria, النمسا ("an-namsa"), is derived from the Slavonic term.
Note also that though the Russian term for the language is ''немецкий'' ''(nemetskij)'', the country is ''Германия'' ''(Germania)''. However, in certain other [[Slavic languages]], such as Czech, the country name (''Německo'') is similar to the name of the language, ''německý'' (jazyk). [[Finns]] and [[Estonians]] use the term ''saksa'', originally from the [[Saxon people|Saxon]] tribe. [[Scandinavians]] use derivatives of the word ''Tyskland/Þýskaland'' (from Theodisca) for the country and ''tysk(a)/þýska'' for the language. [[Hebrew language|Hebrew]] traditionally (nowadays this is not the case) used the Biblical term אַשְׁכֲּנָז ([[Ashkenaz]]) (Genesis 10:3) to refer to Germany, or to certain parts of it, and the [[Ashkenazi]] Jews are those who originate from Germany and [[Eastern Europe]] and formerly spoke Yiddish as their native language, derived from [[Middle High German]]. Modern Hebrew uses גֶּרְמָנִי ''germaní'' (Or גֶּרְמָנִית ''germanít'' for the language).
The French term is ''allemand'', the Spanish term is ''alemán'', the [[Catalan language|Catalan]] term is ''alemany'', and the [[Portuguese language|Portuguese]] term is ''alemão''; all derive from the ancient [[Alamanni]] tribal alliance, meaning literally "''All Men''".
The [[Latvian language|Latvian]] term ''vācu'' means "tinny" and refers disparagingly to the iron-clad [[Teutonic Knights]] that colonized the Baltic in the Middle Ages.
The [[Scottish Gaelic]] term for the German language, ''Gearmailtis'', is formed in the standard way of adding ''-(a)is'' to the end of the country name.
See [[Names for Germany]] for further details on the origins of these and other terms.
GNU General Public License
The '''GNU General Public License''' ('''GNU GPL''' or simply '''GPL''') is a widely used [[free software license]], originally written by [[Richard Stallman]] for the [[GNU project]]. The GPL is the most popular and well-known example of the type of strong [[copyleft]] license that requires derived works to be available under the same copyleft. Under this philosophy, the GPL is said to grant the recipients of a [[computer program]] the rights of the [[free software definition]] and uses copyleft to ensure the freedoms are preserved, even when the work is changed or added to. This is in distinction to [[permissive free software licenses]], of which the [[BSD licenses]] are the standard examples.
The [[GNU Lesser General Public License]] (LGPL) is a modified, more permissive, version of the GPL, originally intended for some [[library (computing)|software libraries]]. There is also a [[GNU Free Documentation License]], which was originally intended for use with documentation for GNU software, but has also been adopted for other uses, such as the [[Wikipedia]] project.
The [[Affero General Public License]] (GNU AGPL) is a similar license with a focus on networking server software. The GNU AGPL is similar to the GNU General Public License, except that it additionally covers the use of the software over a computer network, requiring that the complete source code be made available to any network user of the AGPLed work, for example a web application. The Free Software Foundation recommends that this license is considered for any software that will commonly be run over the network.
==History==
The GPL was written by [[Richard Stallman]] in 1989 for use with programs released as part of the [[GNU project]]. The original GPL was based on a unification of similar licenses used for early versions of [[GNU Emacs]], the [[GNU Debugger]] and the [[GNU Compiler Collection]]. These licenses contained similar provisions to the modern GPL, but were specific to each program, rendering them incompatible, despite being the same license. Stallman's goal was to produce one license that could be used for any project, thus making it possible for many projects to share code.
An important vote of confidence in the GPL came from [[Linus Torvalds]]' adoption of the license for the [[History of the Linux kernel|Linux kernel]] in 1992, switching from an earlier license that prohibited commercial distribution.
As of August 2007, the GPL accounted for nearly 65% of the 43,442 free software projects listed on [[Freshmeat]], and [[As of 2006|as of January 2006]], about 68% of the projects listed on [[SourceForge.net]]. Similarly, a 2001 survey of [[Red Hat Linux]] 7.1 found that 50% of the source code was licensed under the GPL and a 1997 survey of [[Ibiblio|MetaLab]], then the largest free software archive, showed that the GPL accounted for about half of the licenses used. One survey of a large repository of open-source software reported that in July 1997, about half the software packages with explicit license terms used the GPL. Prominent free software programs licensed under the GPL include the [[Linux kernel]] and the [[GNU Compiler Collection]] (GCC). Some other free software programs are [[dual-licensed]] under multiple licenses, often with one of the licenses being the GPL.
Some observers believe that the strong [[copyleft]] provided by the GPL was crucial to the success of Linux, giving the programmers who contributed to it the confidence that their work would benefit the whole world and remain free, rather than being exploited by software companies that would not have to give anything back to the community.
The second version of the license, version 2, was released in 1991. Over the following 15 years, some members of the [[free software community|FOSS (Free and Open Source Software) community]] came to believe that some software and hardware vendors were finding loopholes in the GPL, allowing GPL-licensed software to be exploited in ways that were contrary to the intentions of the programmers. These concerns included [[tivoization]] (the inclusion of GPL-licensed software in hardware that will refuse to run modified versions of its software); the use of unpublished, modified versions of GPL software behind web interfaces; and patent deals between [[Microsoft]] and Linux and Unix distributors that may represent an attempt to use patents as a weapon against competition from Linux.
Version 3 was developed to attempt to address these concerns. It was [http://www.fsf.org/news/gplv3_launched officially released] on [[June 29]], [[2007]].
==Versions==
===Version 1===
Version 1 of the GNU GPL, released in January 1989, prevented what were then the two main ways that software distributors restricted the freedoms that define free software. The first problem was that distributors may publish [[binary file]]s only – executable, but not readable or modifiable by humans. To prevent this, GPLv1 said that any vendor distributing binaries must also make the human readable source code available under the same licensing terms.
The second problem was the distributors might add additional restrictions, either by adding restrictions to the license, or by combining the software with other software which had other restrictions on its distribution. If this was done, then the union of the two sets of restrictions would apply to the combined work, thus unacceptable restrictions could be added. To prevent this, GPLv1 said that modified versions, as a whole, had to be distributed under the terms in GPLv1. Therefore, software distributed under the terms of GPLv1 could be combined with software under more permissive terms, as this would not change the terms under which the whole could be distributed, but software distributed under GPLv1 could not be combined with software distributed under a more restrictive license, as this would conflict with the requirement that the whole be distributable under the terms of GPLv1.
===Version 2===
According to Richard Stallman, the major change in GPLv2 was the "Liberty or Death" clause, as he calls it - Section 7. This section says that if someone has restrictions imposed that ''prevent'' him or her from distributing GPL-covered software in a way that respects other users' freedom (for example, if a legal ruling states that he or she can only distribute the software in binary form), he or she cannot distribute it at all.
By 1990, it was becoming apparent that a less restrictive license would be strategically useful for some software libraries; when version 2 of the GPL (GPLv2) was released in June 1991, therefore, a second license - the Library General Public License (LGPL) was introduced at the same time and numbered with version 2 to show that both were complementary. The version numbers diverged in 1999 when version 2.1 of the LGPL was released, which renamed it the [[GNU Lesser General Public License]] to reflect its place in the GNU philosophy.
===Version 3===
In late 2005, the [[Free Software Foundation]] (FSF) announced work on version 3 of the GPL (GPLv3). On [[January 16]], [[2006]], the first "discussion draft" of GPLv3 was published, and the public consultation began. The public consultation was originally planned for nine to fifteen months but finally stretched to eighteen months with four drafts being published. The official GPLv3 was released by FSF on [[June 29]], [[2007]]. GPLv3 was written by [[Richard Stallman]], with legal counsel from [[Eben Moglen]] and [[Software Freedom Law Center]].
According to Stallman, the most important changes are in relation to [[Software patents and free software|software patents]], [[free software license]] compatibility, the definition of "source code", and hardware restrictions on software modification ("[[tivoization]]"). Other changes relate to internationalisation, how license violations are handled, and how additional permissions can be granted by the copyright holder.
Other notable changes include allowing authors to add certain additional conditions or requirements to their contributions. One of those new optional requirements, sometimes referred to as the Affero clause, is intended to fulfill a request regarding [[software as a service]]; the permitting addition of this requirement makes GPLv3 compatible with the [[Affero General Public License]].
The public consultation process was coordinated by the Free Software Foundation with assistance from [[Software Freedom Law Center]], [[Free Software Foundation Europe]], and other free software groups. Comments were collected from the public via the gplv3.fsf.org web portal. That portal runs purpose-written software called [[stet (software)|stet]]. These comments were passed to four committees comprising approximately 130 people, including supporters and detractors of FSF's goals. Those committees researched the comments submitted by the public and passed their summaries to Stallman for a decision on what the license would do.
During the public consultation process, 962 comments were submitted for the first draft.
By the end, a total of 2,636 comments had been submitted.
The third draft was released on [[March 28]], [[2007]]. This draft included language intended to prevent patent cross-licenses like the controversial [[Novell#Agreement with Microsoft|Microsoft-Novell patent agreement]] and restricts the anti-tivoization clauses to a legal definition of a "User" or "consumer product." It also explicitly removed the section on "Geographical Limitations", whose probable removal had been announced at the launch of the public consultation.
The fourth discussion draft, which was the last, was released on [[May 31]], [[2007]]. It introduced [[Apache Software License]] compatibility, clarified the role of outside contractors, and made an exception to permit the Microsoft-Novell agreement, saying in section 11 paragraph 6 that
This aims to make future such deals ineffective. The license is also meant to cause Microsoft to extend the patent licenses it grants to Novell customers for the use of GPLv3 software to ''all'' users of that GPLv3 software; this is possible only if Microsoft is legally a "conveyor" of the GPLv3 software.
Others, notably some high-profile developers of the [[Linux kernel]], commented to the mass media and made public statements about their objections to parts of discussion drafts 1 and 2.
== Terms and conditions ==
The terms and conditions of the GPL are available to anybody receiving a copy of the work that has a GPL applied to it ("the licensee"). Any licensee who adheres to the terms and conditions is given permission to modify the work, as well as to copy and redistribute the work or any derivative version. The licensee is allowed to charge a fee for this service, or do this free of charge. This latter point distinguishes the GPL from software licenses that prohibit commercial redistribution. The FSF argues that free software should not place restrictions on commercial use, and the GPL explicitly states that GPL works may be sold at any price.
The GPL additionally states that a distributor may not impose "further restrictions on the rights granted by the GPL". This forbids activities such as distributing of the software under a non-disclosure agreement or contract. Distributors under the GPL also grant a license for any of their patents practiced by the software, to practice those patents in GPL software.
Section three of the license requires that programs distributed as pre-compiled binaries are accompanied by a copy of the source code, a written offer to distribute the source code via the same mechanism as the pre-compiled binary or the written offer to obtain the source code that you got when you received the pre-compiled binary under the GPL.
=== Copyleft ===
The distribution rights granted by the GPL for modified versions of the work are not unconditional. When someone distributes a GPL'd work plus their own modifications, the requirements for distributing the whole work cannot be any greater than the requirements that are in the GPL.
This requirement is known as copyleft. It earns its legal power from the use of [[copyright]] on software programs. Because a GPL work is copyrighted, a licensee has no right to redistribute it, not even in modified form (barring [[fair use]]), except under the terms of the license. One is only required to adhere to the terms of the GPL if one wishes to exercise rights normally restricted by copyright law, such as redistribution. Conversely, if one distributes copies of the work without abiding by the terms of the GPL (for instance, by keeping the source code secret), he or she can be [[lawsuit|sued]] by the original author under copyright law.
Copyleft thus uses copyright law to accomplish the opposite of its usual purpose: instead of imposing restrictions, it grants rights to other people, in a way that ensures the rights cannot subsequently be taken away. It also ensures that unlimited redistribution rights are not granted, should any legal flaw (or "[[computer bug|bug]]") be found in the copyleft statement.
Many distributors of GPL'ed programs bundle the source code with the [[executable]]s. An alternative method of satisfying the copyleft is to provide a written offer to provide the source code on a physical medium (such as a CD) upon request. In practice, many GPL'ed programs are distributed over the [[Internet]], and the source code is made available over [[File Transfer Protocol|FTP]]. For Internet distribution, this complies with the license.
Copyleft applies only when a person seeks to redistribute the program. One is allowed to make private modified versions, without any obligation to divulge the modifications as long as the modified software is not distributed to anyone else. Note that the copyleft applies only to the software and not to its output (unless that output is itself a derivative work of the program); for example, a public web portal running a modified derivative of a GPL'ed [[content management system]] is not required to distribute its changes to the underlying software.
==Licensing and contractual issues==
The GPL was designed as a [[license]], rather than a [[contract]]. In some [[Common Law]] jurisdictions, the legal distinction between a license and a contract is an important one: contracts are enforceable by [[contract law]], whereas licenses are enforced under [[copyright law]]. However, this distinction is not useful in the many jurisdictions where there are no differences between contracts and licenses, such as [[Civil law (legal system)|Civil Law]] systems.
Those who do not agree to the GPL's terms and conditions do not have permission, under copyright law, to copy or distribute GPL licensed software or derivative works. However, they may still use the software however they like.
== Copyright holders ==
The text of the GPL is itself copyrighted, and the copyright is held by the [[Free Software Foundation]] (FSF). However, the FSF does not hold the copyright for a work released under the GPL, unless an author explicitly assigns copyrights to the FSF (which seldom happens except for programs that are part of the [[GNU]] project). Only the individual copyright holders have the authority to sue when a license violation takes place.
The FSF permits people to create new licenses based on the GPL, as long as the derived licenses do not use the GPL preamble without permission. This is discouraged, however, since such a license is generally incompatible with the GPL. (See the [http://www.fsf.org/licenses/gpl-faq.html#ModifyGPL GPL FAQ] for more information.)
Other licenses created by the GNU project include the [[GNU Lesser General Public License]] and the [[GNU Free Documentation License]].
== The GPL in court ==
A key dispute related to the GPL is whether or not non-GPL software can [[library linking|dynamically link]] to GPL libraries. The GPL is clear in requiring that all [[derivative work]]s of GPL'ed code must themselves be GPL'ed. However, it is not clear whether an executable that dynamically links to a GPL code should be considered a derivative work. The free/open-source software community is split on this issue. The FSF asserts that such an executable is indeed a derivative work if the executable and GPL code "make function calls to each other and share data structures," with others agreeing, while some (e.g. [[Linus Torvalds]]) agree that dynamic linking can create derived works but disagree over the circumstances. On the other hand, some experts have argued that the question is still open: one [[Novell]] lawyer has written that dynamic linking not being derivative "makes sense" but is not "clear-cut," and [[Lawrence Rosen]] has claimed that a court of law would "probably" exclude dynamic linking from derivative works although "there are also good arguments" on the other side and "the outcome is not clear" (on a later occasion, he argued that "market-based" factors are more important than the linking technique). This is ultimately a question not of the GPL ''per se'', but of how copyright law defines derivative works. In ''[[Galoob v. Nintendo]]'' the [[Ninth Circuit Court of Appeals]] defined a derivative work as having "'form' or permanence" and noted that "the infringing work must incorporate a portion of the copyrighted work in some form," but there have been no clear court decisions to resolve this particular conflict.
Since there is no record of anyone circumventing the GPL by dynamic linking and contesting when threatened with lawsuits by the copyright holder, the restriction appears ''[[de facto]]'' enforceable even if not yet proven ''[[de jure]]''.
In 2002, MySQL AB sued Progress NuSphere for copyright and trademark infringement in [[U.S. District Court for the District of Massachusetts|United States district court]]. NuSphere had allegedly violated MySQL's copyright by linking code for the Gemini table type into the MySQL server. After a preliminary hearing before Judge [[Patti Saris]] on [[February 27]], [[2002]], the parties entered settlement talks and eventually settled. At the hearing, Judge Saris "saw no reason" that the GPL would not be enforceable.
In August 2003, the [[SCO Group]] stated that they believed the GPL to have no legal validity, and that they intended to take up lawsuits over sections of code supposedly copied from SCO Unix into the [[Linux kernel]]. This was a problematic stand for them, as they had distributed Linux and other GPL'ed code in their [[Caldera OpenLinux]] distribution, and there is little evidence that they had any legal right to do so except under the terms of the GPL. For more information, see [[SCO-Linux controversies]] and [[SCO v. IBM]].
In April 2004 the [[netfilter/iptables]] project was granted a preliminary [[injunction]] against Sitecom Germany by [[Munich]] District Court after Sitecom refused to desist from distributing Netfilter's GPL'ed software in violation of the terms of the GPL. On July 2004 , the German court confirmed this injunction as a final ruling against Sitecom. The court's justification for its decision exactly mirrored the predictions given earlier by the FSF's [[Eben Moglen]]:
: ''Defendant has infringed on the copyright of plaintiff by offering the software 'netfilter/iptables' for download and by advertising its distribution, without adhering to the license conditions of the GPL. Said actions would only be permissible if defendant had a license grant... This is independent of the questions whether the licensing conditions of the GPL have been effectively agreed upon between plaintiff and defendant or not. If the GPL were not agreed upon by the parties, defendant would notwithstanding lack the necessary rights to copy, distribute, and make the software 'netfilter/iptables' publicly available.''
This ruling was important because it was the first time that a court had confirmed that violating terms of the GPL was an act of copyright violation. However, the case was not as crucial a test for the GPL as some have concluded. In the case, the enforceability of GPL itself was not under attack. Instead, the court was merely attempting to discern if the license itself was in effect.
In May of [[2005]], [[Wallace versus International Business Machines et al|Daniel Wallace]] filed suit against the [[Free Software Foundation]] (FSF) in the [[U.S. District Court for the Southern District of Indiana|Southern District of Indiana]], contending that the GPL is an illegal attempt to fix prices at zero. The suit was dismissed in March 2006, on the grounds that Wallace had failed to state a valid anti-trust claim; the court noted that "the GPL encourages, rather than discourages, free competition and the distribution of computer operating systems, the benefits of which directly pass to consumers." Wallace was denied the possibility of further amending his complaint, and was ordered to pay the FSF's legal expenses.
On September 8, 2005, Seoul Central District Court ruled that GPL has no legal relevance concerning the case dealing with [[trade secret]] derived from GPL-licensed work. Defendants argued that since it is impossible to maintain trade secret while being compliant with GPL and distributing the work, they aren't in breach of trade secret. This argument was considered without ground.
On September 6, 2006, the [[gpl-violations.org]] project prevailed in court litigation against D-Link Germany GmbH regarding D-Link's inappropriate and copyright infringing use of parts of the Linux Operating System Kernel. The judgment finally provided the on-record, legal precedent that the GPL is valid and legally binding, and that it will stand up in German court.
In late 2007, the developers of [[BusyBox]] and the [[Software Freedom Law Center]] embarked upon a program to gain GPL compliance from distributors of BusyBox in [[embedded system]]s, suing those who would not comply. These were claimed to be the first US uses of courts for enforcement of GPL obligations. ''See'' [[BusyBox#GPL lawsuits]].
== Compatibility and multi-licensing==
Many of the most common free software licenses, such as the original [[MIT License|MIT/X license]], the [[BSD license]] (in its current 3-clause form), and the [[GNU Lesser General Public License|LGPL]], are "GPL-[[License compatibility|compatible]]". That is, their code can be combined with a program under the GPL without conflict (the new combination would have the GPL applied to the whole). However, some free/open source software licenses are not GPL-compatible. Many GPL proponents have strongly advocated that free/open source software developers use only GPL-compatible licenses, because doing otherwise makes it difficult to reuse software in larger wholes. Note that this issue only arises in concurrent use of licenses which impose conditions on their manner of combination. Some licenses, such as the BSD license, impose no conditions on the manner of their combination.
Also see the [[list of FSF approved software licenses]] for examples of compatible and incompatible licenses.
A number of businesses use [[dual-licensing]] to distribute a GPL version and sell a [[proprietary software|proprietary]] license to companies wishing to combine the package with proprietary code, using dynamic linking or not. Examples of such companies include [[MySQL AB]], [[Trolltech]] ([[Qt (toolkit)|Qt toolkit]]), [[Namesys]] ([[ReiserFS]]) and [[Red Hat]] ([[Cygwin]]).
== Adoption ==
The Open Source License Resource Center maintained by [[Black Duck Software]] shows that GPL is the license used in about 70% of all open source software. The vast majority of projects are released under GPL 2 with 3000 open source projects having migrated to GPL 3.
==Criticism==
In [[2001]] [[Microsoft]] [[CEO]] [[Steve Ballmer]] referred to Linux as "a cancer that attaches itself in an intellectual property sense to everything it touches." Critics of Microsoft claim that the real reason Microsoft dislikes the GPL is that the GPL resists proprietary vendors' attempts to "[[embrace, extend and extinguish]]". Microsoft has released [[Microsoft Windows Services for UNIX]] which contains GPL-licensed code. In response to Microsoft's attacks on the GPL, several prominent Free Software developers and advocates released a joint statement supporting the license.
The GPL has been described as being [[Copyleft#Is copyleft .22viral.22.3F|"viral"]] by many of its critics because the GPL only allows conveyance of whole programs, which means that programmers are not allowed to convey programs that [[GPL linking exception|link]] to libraries having GPL-incompatible licenses. The so-called "viral" effect of this is that under such circumstances disparately licensed software cannot be combined unless one of the licenses is changed. Although theoretically either license could be changed, in the "viral" scenario the GPL cannot be practically changed (because the software may have so many contributors, some of whom will likely refuse), whereas the license of the other software ''can'' be practically changed.
This is part of a [[BSD and GPL licensing|philosophical difference]] between the GPL and permissive free software licenses such as the [[BSD licenses|BSD-style licenses]], which do not put such a requirement on modified versions. While proponents of the GPL believe that free software should ensure that its freedoms are preserved all the way from the developer to the user, others believe that intermediaries between the developer and the user should be free to redistribute the software as non-free software. More specifically, the GPL requires that redistribution occur subject to the GPL, whereas more "permissive" licenses allow redistribution to occur under licenses more restrictive than the original license.
While the GPL does allow commercial distribution of GPL software, the market price will settle near the price of distribution—near zero—since the purchasers may redistribute the software and its source code for their cost of redistribution. This could be seen to inhibit commercial use of GPL'ed code by others wishing to use that code for proprietary purposes—if they don't wish to avail themselves of GPL'ed code, they will have to re-implement it themselves. Microsoft has included anti-GPL terms in their open source software.
In addition, the [[FreeBSD]] project has stated that "a less publicized and unintended use of the GPL is that it is very favorable to large companies that want to undercut software companies. In other words, the GPL is well suited for use as a marketing weapon, potentially reducing overall economic benefit and contributing to monopolistic behavior". It's not clear that there are any cases of this happening in practice, however.
The GPL has no [[Indemnity|indemnification]] clause explicitly protecting maintainers and developers from litigation resulting from unscrupulous contribution. (If a developer submits existing patented or copyright work to a GPL project claiming it as their own contribution, all the project maintainers and even other developers can be held legally responsible for damages to the copyright or patent holder.) Lack of indemnification is one criticism that lead Mozilla to create the [[Mozilla Public License]] rather than use the GPL or LGPL. However, Mozilla later relicensed their work under a GPL/LGPL/MPL triple license, due to problems with the GPL-incompatibility of the MPL.
Some software developers have found the extensive scope of the GPL to be too restrictive. For example, Bjørn Reese and Daniel Stenberg describe how the downstream effects of the GPL on later developers creates a "quodque pro quo" (Latin, "Everything in return for something"). For that reason, in 2001 they abandoned the GPLv2 in favor of less restrictive copyleft licenses.
A more specific example of the downstream effects of the GPL can be observed through the frame of incompatible licenses. Sun Microsystems' ZFS, because it is licensed under the GPL-incompatible CDDL and covered by several Sun patents, cannot link to the GPL-licensed linux kernel.
Some have also argued that the GPL could, and should, be shorter.
Google
'''Google Inc.''' ( and ) is an [[United States|American]] [[public company|public corporation]], earning revenue from [[AdWords|advertising]] related to its [[Google search|Internet search]], [[Gmail|web-based e-mail]], [[Google Maps|online mapping]], [[Google Apps|office productivity]], [[Orkut|social networking]], and [[YouTube|video sharing]] services as well as selling advertising-free versions of the [[Google Search Appliance|same technologies]]. Google's headquarters, the [[Googleplex]], is located in [[Mountain View, California]]. As of [[June 30]] [[2008]] the company has 19,604 full-time employees. As of [[October 31]], [[2007]], it is the largest American company (by [[market capitalization]]) that is not part of the [[Dow Jones Industrial Average]].
Google was co-founded by [[Larry Page]] and [[Sergey Brin]] while they were students at [[Stanford University]] and the company was first incorporated as a [[privately held company]] on [[September 7]], [[1998]]. Google's [[initial public offering]] took place on [[August 19]], [[2004]], raising [[United States dollar|US$]]1.67 billion, making it worth US$23 billion. Google has continued its growth through a series of new product developments, [[List of Google acquisitions|acquisitions]], and [[Google#Partnerships|partnerships]]. [[Google#Environmentalism|Environmentalism]], [[Google.org|philanthropy]], and [[Google#Corporate affairs and culture|positive employee relations]] have been important tenets during Google's growth, the latter resulting in being identified multiple times as [[Fortune Magazine|Fortune Magazine's]] #1 Best Place to Work. The company's unofficial slogan is "[[Don't be evil]]", although [[criticism of Google]] include concerns regarding the [[privacy]] of personal information, [[copyright]], [[censorship by Google|censorship]], and discontinuation of services.
==History==
Google began in January 1996, as a research project by [[Larry Page]], who was soon joined by [[Sergey Brin]], two [[Doctor of Philosophy|Ph.D.]] students at [[Stanford University]] in [[California]]. They hypothesized that a search engine that analyzed the relationships between websites would produce better ranking of results than existing techniques, which ranked results according to the number of times the search term appeared on a page. Their search engine was originally nicknamed "BackRub" because the system checked [[backlinks]] to estimate a site's importance. A small search engine called Rankdex was already exploring a similar strategy.
Convinced that the pages with the most links to them from other highly relevant web pages must be the most relevant pages associated with the search, Page and Brin tested their thesis as part of their studies, and laid the foundation for their search engine. Originally, the search engine used the [[Stanford University]] website with the domain ''google.stanford.edu''. The domain ''google.com'' was registered on [[September 15]], [[1997]], and the company was incorporated as ''Google Inc.'' on [[September 7]], [[1998]] at a friend's garage in [[Menlo Park, California]]. The total initial investment raised for the new company amounted to almost US$1.1 million, including a US$100,000 check by [[Andy Bechtolsheim]], one of the founders of [[Sun Microsystems]].
In March 1999, the company moved into offices in [[Palo Alto, California|Palo Alto]], home to several other noted [[Silicon Valley]] technology startups. After quickly outgrowing two other sites, the company leased a complex of buildings in [[Mountain View, Santa Clara County, California|Mountain View]] at 1600 Amphitheatre Parkway from [[Silicon Graphics]] (SGI) in 2003. The company has remained at this location ever since, and the complex has since come to be known as the [[Googleplex]] (a play on the word [[googolplex]]). In 2006, Google bought the property from SGI for US$319 million.
The Google search engine attracted a loyal following among the growing number of Internet users, who liked its simple design and usability. In 2000, Google began selling [[advertising|advertisements]] associated with search [[keyword (internet search)|keywords]]. The ads were text-based to maintain an uncluttered page design and to maximize page loading speed. Keywords were sold based on a combination of price bid and clickthroughs, with bidding starting at US$.05 per click. This model of selling keyword advertising was pioneered by [[Yahoo! Search Marketing|Goto.com]] (later renamed Overture Services, before being acquired by [[Yahoo!]] and rebranded as [[Yahoo! Search Marketing]]). While many of its [[dot-com]] rivals failed in the new Internet marketplace, Google quietly rose in stature while generating revenue.
The name "Google" originated from a common misspelling of the word "[[googol]]", which refers to 10100, the number represented by a 1 followed by one hundred zeros. Having found its way increasingly into everyday language, the verb "[[google (verb)|google]]", was added to the ''[[Merriam-Webster|Merriam Webster Collegiate Dictionary]]'' and the ''[[Oxford English Dictionary]]'' in 2006, meaning "to use the Google search engine to obtain information on the Internet."
A [[patent]] describing part of Google's ranking mechanism ([[PageRank]]) was granted on [[September 4]], [[2001]]. The patent was officially assigned to Stanford University and lists Lawrence Page as the inventor.
===Financing and initial public offering===
The first funding for Google as a company was secured in 1998, in the form of a US$100,000 contribution from [[Andy Bechtolsheim]], co-founder of [[Sun Microsystems]], given to a corporation which did not yet exist. Around six months later, a much larger round of funding was announced, with the major investors being rival venture capital firms [[Kleiner Perkins Caufield & Byers]] and [[Sequoia Capital]].
Google's [[IPO]] took place on [[August 19]], [[2004]]. 19,605,052 [[stock|shares]] were offered at a price of US$85 per share. Of that, 14,142,135 (another mathematical reference as [[square root of two|√2]] ≈ 1.4142135) were floated by Google, and the remaining 5,462,917 were offered by existing stockholders. The sale of US$1.67 billion gave Google a [[market capitalization]] of more than US$23 billion. The vast majority of Google's 271 million shares remained under Google's control. Many of Google's employees became instant [[paper millionaires]]. [[Yahoo!]], a competitor of Google, also benefited from the IPO because it owned 8.4 million shares of Google as of [[August 9]], [[2004]], ten days before the IPO.
Google's stock performance after its first IPO launch has gone well, with shares hitting US$700 for the first time on [[October 31]], [[2007]], due to strong sales and earnings in the advertising market, as well as the release of new features such as the [[Google Desktop|desktop search function]] and its iGoogle personalized home page. The surge in stock price is fueled primarily by individual investors, as opposed to large institutional investors and [[mutual fund]]s.
The company is listed on the [[NASDAQ]] stock exchange under the [[ticker]] symbol '''GOOG''' and under the [[London Stock Exchange]] under the ticker symbol '''GGEA'''.
===Growth===
While the company's primary business interest is in the web content arena, Google has begun experimenting with other markets, such as [[radio]] and print publications. On [[January 17]], [[2006]], Google announced that its purchase of a radio advertising company "dMarc", which provides an automated system that allows companies to advertise on the radio. This will allow Google to combine two niche advertising media—the Internet and radio—with Google's ability to laser-focus on the tastes of consumers. Google has also begun an experiment in selling advertisements from its advertisers in offline newspapers and magazines, with select advertisements in the [[Chicago Sun-Times]]. They have been filling unsold space in the newspaper that would have normally been used for in-house advertisements.
Google was added to the [[S&P 500 index]] on [[March 30]], [[2006]]. It replaced [[Burlington Resources]], a major oil producer based in [[Houston]] which was acquired by [[ConocoPhillips]].
===Acquisitions===
Since 2001, Google has acquired several small start-up companies, often consisting of innovative teams and products. One of the earlier companies that Google bought was [[Pyra Labs]]. They were the creators of [[Blogger (service)|Blogger]], a weblog publishing platform, first launched in 1999. This acquisition led to many premium features becoming free. Pyra Labs was originally formed by [[Evan Williams (blogger)|Evan Williams]], yet he left Google in 2004. In early 2006, Google acquired Upstartle, a company responsible for the online word processor, [[Writely]]. The technology in this product was used by Google to eventually create [[Google Docs & Spreadsheets]].
In 2004, Google acquired a company called [[Keyhole, Inc.]], which developed a product called ''Earth Viewer'' which was renamed in 2005 to [[Google Earth]].
In February 2006, software company Adaptive Path sold Measure Map, a [[weblog]] statistics application, to Google. Registration to the service has since been temporarily disabled. The last update regarding the future of Measure Map was made on [[April 6]], [[2006]] and outlined many of the service's known issues.
In late 2006, Google bought online video site [[YouTube]] for US$1.65 billion in stock. Shortly after, on [[October 31]], [[2006]], Google announced that it had also acquired [[JotSpot]], a developer of wiki technology for collaborative Web sites.
On [[April 13]], [[2007]], Google reached an agreement to acquire [[DoubleClick]]. Google agreed to buy the company for US$3.1 billion.
On [[July 9]], [[2007]], Google announced that it had signed a definitive agreement to acquire enterprise messaging security and compliance company [[Postini]].
===Partnerships===
In 2005, Google entered into partnerships with other companies and government agencies to improve production and services. Google announced a partnership with [[NASA Ames Research Center]] to build up of offices and work on research projects involving large-scale data management, [[nanotechnology]], [[distributed computing]], and the entrepreneurial space industry. Google also entered into a partnership with [[Sun Microsystems]] in October to help share and distribute each other's technologies. The company entered into a partnership with [[Time Warner]]'s [[AOL]], to enhance each other's video search services.
The same year, the company became a major financial investor of the new [[.mobi]] [[top-level domain]] for mobile devices, in conjunction with several other companies, including [[Microsoft]], [[Nokia]], and [[Ericsson]] among others. In September 2007, Google launched, "Adsense for Mobile", a service for its publishing partners which provides the ability to monetize their mobile websites through the targeted placement of mobile text ads, and acquired the mobile social networking site, ''Zingku.mobi'', to "provide people worldwide with direct access to Google applications, and ultimately the information they want and need, right from their mobile devices."
In 2006, Google and [[News Corporation|News Corp.]]'s Fox Interactive Media entered into a US$900 million agreement to provide search and advertising on the popular social networking site, [[MySpace]].
On November 5, 2007 Google announced the [[Open Handset Alliance]] to develop an open platform for mobile services called [[Google Android|Android]].
On March,2008 Google, [[Sprint]], [[Intel]], [[Comcast]], [[Time Warner Cable]],[[Bright House Networks]],[[Clearwire]] together found [[Xohm]] to provide wireless [[telecommunication]] service.
==Products and services==
Google has created services and tools for the general public and business environment alike; including Web applications, advertising networks and solutions for businesses.
===Advertising===
Most of Google's revenue is derived from advertising programs. For the 2006 fiscal year, the company reported US$10.492 billion in total advertising revenues and only US$112 million in licensing and other revenues. Google [[AdWords]] allows Web advertisers to display advertisements in Google's search results and the Google Content Network, through either a cost-per-click or cost-per-view scheme. Google [[AdSense]] website owners can also display adverts on their own site, and earn money every time ads are clicked.
===Web-based software===
The [[Google search|Google web search engine]] is the company's most popular service. As of August 2007, Google is the most used [[search engine]] on the web with a 53.6% market share, ahead of [[Yahoo!]] (19.9%) and [[Live Search]] (12.9%). Google indexes billions of Web pages, so that users can search for the information they desire, through the use of [[keyword (Internet search)|keywords]] and [[operators]]. Google has also employed the Web Search technology into other search services, including Image Search, [[Google News]], the price comparison site [[Google Product Search]], the interactive [[Usenet]] archive [[Google Groups]], [[Google Maps]], and more.
In 2004, Google launched its own free web-based e-mail service, known as [[Gmail]] (or Google Mail in some jurisdictions). Gmail features [[e-mail filtering|spam-filtering technology]] and the capability to use Google technology to search e-mail. The service generates revenue by displaying advertisements and links from the [[AdWords]] service that are tailored to the choice of the user and/or content of the e-mail messages displayed on screen.
In early 2006, the company launched [[Google Video]], which not only allows users to search and view freely available videos but also offers users and media publishers the ability to publish their content, including television shows on [[CBS]], [[NBA]] basketball games, and music videos. In August 2007, Google announced that it would shut down its video rental and sale program and offer refunds and [[Google Checkout]] credits to consumers who had purchased videos to own.
On [[February 28]], [[2008]] Google launched the [[Google Sites]] [[wiki]] as a [[Google Apps]] component.
Google has also developed several desktop applications, including [[Google Earth]], an interactive mapping program powered by satellite and aerial imagery that covers the vast majority of the planet. Google Earth is generally considered to be remarkably accurate and extremely detailed. Many major cities have such detailed images that one can zoom in close enough to see vehicles and pedestrians clearly. Consequently, there have been some concerns about national security implications. Specifically, some countries and militaries contend the software can be used to pinpoint with near-precision accuracy the physical location of critical infrastructure, commercial and residential buildings, bases, government agencies, and so on. However, the satellite images are not necessarily frequently updated, and all of them are available at no charge through other products and even government sources. For example, [[NASA]] and the [[NGA|National Geospatial-Intelligence Agency]]. Some counter this argument by stating that Google Earth makes it easier to access and research the images.
Many other products are available through [[Google Labs]], which is a collection of incomplete applications that are still being tested for use by the general public.
Google has promoted their products in various ways. In [[London]], ''Google Space'' was set-up in [[Heathrow Airport]], showcasing several products, including Gmail, Google Earth and Picasa. Also, a similar page was launched for American college students, under the name ''College Life, Powered by Google.''
In 2007, some reports surfaced that Google was planning the release of its own mobile phone, possibly a competitor to [[Apple Inc.|Apple]]'s [[iPhone]]. The project, called [[Android (mobile phone platform)|Android]] provides a standard development kit that will allow any "Android" phone to run software developed for the Android SDK, no matter the phone manufacturer. In October 2007, Google SMS service was launched in [[India]] allowing users to get business listings, movie showtimes, and information by sending an [[SMS]].
===Enterprise products===
In 2007, Google launched [[Google Apps|Google Apps Premier Edition]], a version of Google Apps targeted primarily at the business user. It includes such extras as more disk space for e-mail, API access, and premium support, for a price of US$50 per user per year. A large implementation of Google Apps with 38,000 users is at [[Lakehead University]] in [[Thunder Bay, Ontario|Thunder Bay]], Ontario, Canada.
==Platform==
Google runs its services on several [[server farm]]s, each comprising thousands of low-cost commodity computers running stripped-down versions of [[Linux]]. While the company divulges no details of its hardware, a 2006 estimate cites 450,000 servers, "racked up in clusters at data centers around the world."
==Corporate affairs and culture==
Google is known for its relaxed corporate culture, of which its playful variations on [[Google logo#History of the Google Doodle|its own corporate logo]] are an indicator. In 2007 and 2008, ''[[Fortune Magazine]]'' placed Google at the top of its list of the hundred best places to work. Google's corporate philosophy embodies such casual principles as "you can make money without doing evil," "you can be serious without a suit," and "work should be challenging and the challenge should be fun."
Google has been criticized for having salaries below industry standards. For example, some [[system administrator]]s earn no more than US$35,000 per year – considered to be quite low for the [[San Francisco Bay Area|Bay Area]] job market. However, Google's stock performance following its [[Initial public offering|IPO]] has enabled many early employees to be competitively compensated by participation in the corporation's remarkable equity growth. Google implemented other employee incentives in 2005, such as the [[Google Founders' Award]], in addition to offering higher salaries to new employees. Google's workplace amenities, culture, global popularity, and strong brand recognition have also attracted potential applicants.
After the company's [[IPO]] in August 2004, it was reported that founders [[Sergey Brin]] and [[Larry Page]], and CEO [[Eric E. Schmidt|Eric Schmidt]], requested that their base salary be cut to US$1.00. Subsequent offers by the company to increase their salaries have been turned down, primarily because, "their primary compensation continues to come from returns on their ownership stakes in Google. As significant stockholders, their personal wealth is tied directly to sustained stock price appreciation and performance, which provides direct alignment with stockholder interests." Prior to 2004, Schmidt was making US$250,000 per year, and Page and Brin each earned a salary of US$150,000.
They have all declined recent offers of bonuses and increases in compensation by Google's board of directors. In a 2007 report of the United States' richest people, [[Forbes]] reported that [[Sergey Brin]] and [[Larry Page]] were tied for #5 with a net worth of US$18.5 billion each.
In 2007 and through early 2008, Google has seen the departure of several top executives. Justin Rosenstein, Google’s product manager, left in June of 2007. Shortly thereafter, Gideon Yu, former chief financial officer of [[YouTube]], a Google unit, joined [[Facebook]] along with Benjamin Ling, a high-ranking engineer, who left in October 2007. In March 2008, two senior Google leaders announced their desire to pursue other opportunities. Sheryl Sandburg, ex-VP of global online sales and operations began her position as COO of [[Facebook]] while Ash ElDifrawi, former head of brand advertising, left to become CMO of [[Netshops]] Inc.
===Googleplex===
Google's headquarters in Mountain View, California, is referred to as "the [[Googleplex]]" in a play of words; a [[googolplex]] being 1 followed by a googol of zeros, and the HQ being a [[complex]] of buildings (cf. [[movie theater|multiplex]], cineplex, etc). The lobby is decorated with a [[piano]], [[lava lamps]], old server clusters, and a projection of search queries on the wall. The hallways are full of exercise balls and [[bicycle]]s. Each employee has access to the corporate recreation center. Recreational amenities are scattered throughout the campus and include a workout room with weights and rowing machines, locker rooms, washers and dryers, a massage room, assorted [[video game]]s, [[Foosball]], a [[piano|baby grand piano]], a pool table, and [[ping pong]]. In addition to the [[Recreation room|rec room]], there are snack rooms stocked with various foods and drinks.
In 2006, Google moved into of office space in [[New York City]], at 111 [[Eighth Avenue|Eighth Ave.]] in Manhattan. The office was specially designed and built for Google and houses its largest advertising sales team, which has been instrumental in securing large partnerships, most recently deals with [[MySpace]] and [[AOL]]. In 2003, they added an engineering staff in New York City, which has been responsible for more than 100 engineering projects, including [[Google Maps]], [[Google Spreadsheet]]s, and others. It is estimated that the building costs Google US$10 million per year to rent and is similar in design and functionality to its [[Mountain View, California|Mountain View]] headquarters, including [[foosball]], [[air hockey]], and ping-pong tables, as well as a video game area. In November 2006, Google opened offices on [[Carnegie Mellon]]'s campus in [[Pittsburgh, Pennsylvania|Pittsburgh]]. By late 2006, Google also established a new headquarters for its AdWords division in [[Ann Arbor, Michigan]].
The size of Google's search system is presently undisclosed. The best estimates place the total number of the company's servers at 450,000, spread over twenty five locations throughout the world, including major [[network operations center|operations centers]] in [[Dublin]] (European Operations [[Headquarters]]) and [[Atlanta, Georgia]]. Google is also in the process of constructing a major operations center in [[The Dalles, Oregon]], on the banks of the [[Columbia River]]. The site, also referred to by the media as ''Project 02'', was chosen due to the availability of inexpensive [[hydroelectric power]] and a large surplus of [[fiber optic]] cable, remnants of the dot com boom of the late 1990s. The computing center is estimated to be the size of two [[American football|football fields]], and it has created hundreds of construction jobs, causing local real estate prices to increase 40%. Upon completion, the center is expected to create 60 to 200 permanent jobs in the town of 12,000 people.
Google is taking steps to ensure that their operations are environmentally sound. In October 2006, the company announced plans to install thousands of [[Photovoltaic module|solar panels]] to provide up to 1.6 [[megawatt]]s of [[electricity]], enough to satisfy approximately 30% of the campus' energy needs. The system will be the largest solar power system constructed on a [[United States|U.S.]] corporate campus and one of the largest on any corporate site in the world. In June 2007, Google announced that they plan to become [[carbon neutral]] by 2008, which includes investing in energy efficiency, renewable energy sources, and purchasing carbon offsets, such as investing in projects like capturing and burning [[methane]] from animal waste at Mexican and Brazilian farms.
===Innovation time off===
As an interesting motivation technique (usually called [[ITO|Innovation Time Off]]), all Google engineers are encouraged to spend 20% of their work time (one day per week) on projects that interest them. Some of Google's newer services, such as [[Gmail]], [[Google News]], [[Orkut]], and [[AdSense]] originated from these independent endeavors. In a talk at [[Stanford University]], [[Marissa Mayer]], Google's Vice President of Search Products and User Experience, stated that her analysis showed that half of the new product launches originated from the 20% time.
===Easter eggs and April Fool's Day jokes===
Google has a tradition of creating [[April Fool's Day]] jokes—such as [[Google's hoaxes#2000|Google MentalPlex]], which allegedly featured the use of mental power to search the web. In 2002, they claimed that [[pigeons]] were the [[Google's hoaxes#2002: Pigeon Rank|secret]] behind their growing [[search engine]]. In 2004, they featured [[Google's hoaxes#2004: Google Lunar/Copernicus Center|Google Lunar]] (which claimed to feature jobs on the [[moon]]), and in 2005, a [[fiction|fictitious]] brain-boosting drink, termed [[Google's hoaxes#2005: Google Gulp|Google Gulp]] was announced. In 2006, they came up with [[Google's hoaxes#2006: Google Romance|Google Romance]], a hypothetical [[online dating]] service. In 2007, Google announced two joke products. The first was a free wireless Internet service called [[TiSP]] (Toilet Internet Service Provider) in which one obtained a connection by flushing one end of a [[fiber-optic]] cable down their toilet and waiting only an hour for a "Plumbing Hardware Dispatcher (PHD)" to connect it to the Internet. Additionally, Google's [[Gmail]] page displayed an announcement for [[Gmail Paper]], which allows users of their free email service to have email messages printed and shipped to a snail mail address.
Google's services contain a number of [[Easter egg (virtual)|Easter eggs]]; for instance, the Language Tools page offers the search interface in the [[Swedish Chef]]'s "Bork bork bork," [[Pig Latin]], ”Hacker” (actually [[leetspeak]]), [[Elmer Fudd]], and [[Klingon language|Klingon]]. In addition, the search engine calculator provides the [[Answer to Life, the Universe, and Everything]] from [[Douglas Adams]]' ''[[The Hitchhiker's Guide to the Galaxy]]''. As Google's search box can be used as a unit converter (as well as a calculator), some non-standard units are built in, such as the [[Smoot]]. Google also routinely modifies its logo in accordance with various holidays or special events throughout the year, such as [[Christmas]], [[Mother's Day]], or the [[birthday]]s of various notable individuals.
===IPO and culture===
Many people speculated that Google's [[initial public offering|IPO]] would inevitably lead to changes in the company's culture, because of shareholder pressure for employee benefit reductions and short-term advances, or because a large number of the company's employees would suddenly become millionaires on paper. In a report given to potential investors, co-founders Sergey Brin and Larry Page promised that the IPO would not change the company's culture. Later Mr. Page said, "We think a lot about how to maintain our culture and the fun elements. We spent a lot of time getting our offices right. We think it's important to have a high density of people. People are packed together everywhere. We all share offices. We like this set of buildings because it's more like a densely packed university campus than a typical suburban office park."
However, many analysts are finding that as Google grows, the company is becoming more "corporate". In 2005, articles in ''[[The New York Times]]'' and other sources began suggesting that Google had lost its anti-corporate, no evil philosophy.
In an effort to maintain the company's unique culture, Google has designated a Chief Culture Officer in 2006, who also serves as the Director of Human Resources. The purpose of the Chief Culture Officer is to develop and maintain the culture and work on ways to keep true to the core values that the company was founded on in the beginning—a flat organization, a lack of hierarchy, a collaborative environment.
===Philanthropy===
In 2004, Google formed a for-profit philanthropic wing, [[Google.org]], with a start-up fund of US$1 billion. The express mission of the organization is to create awareness about [[climate change]], global public health, and [[global poverty]]. One of its first projects is to develop a viable [[plug-in hybrid]] [[electric vehicle]] that can attain 100 [[fuel economy in automobiles|mpg]]. The founding and current director is Dr. [[Larry Brilliant]].
==Criticism==
As it has grown, Google has found itself the focus of several controversies related to its business practices and services. For example, [[Google Book Search]]'s effort to digitize millions of books and make the full text searchable has led to [[copyright]] disputes with the [[Authors Guild]]. Google's cooperation with the governments of [[People's Republic of China|China]], and to a lesser extent [[France]] and [[Germany]] (regarding [[Holocaust denial]]) to filter search results in accordance to regional laws and regulations has led to claims of [[censorship by Google|censorship]]. Google's persistent [[HTTP cookie|cookie]] and other information collection practices have led to concerns over user [[Google and privacy issues|privacy]]. As of [[December 11]], [[2007]], Google, like the [[Microsoft]] search engine, stores "personal information for 18 months" and by comparison, [[Yahoo!]] and [[AOL]] ([[Time Warner]]) "retain search requests for 13 months."
A number of [[India]]n state governments have raised concerns about the security risks posed by geographic details provided by [[Google Earth]]'s satellite imaging. Google has also been criticized by advertisers regarding its inability to combat [[click fraud]], when a person or automated script is used to generate a charge on an advertisement without really having an interest in the product. Industry reports in 2006 claim that approximately 14 to 20 percent of clicks were in fact fraudulent or invalid. Further, Google has faced allegations of [[sexism]] and [[ageism]] from former employees. Google has also faced accusations in [[Harper's Magazine]] of being extremely excessive with their energy usage, and were accused of employing their "[[Don't be evil]]" motto as well as their very public energy saving campaigns as means of trying to cover up or make up for the massive amounts of energy their servers actually require.
Also, US District Court Judge [[Louis Stanton]], on [[July 1]], 2008 ordered Google to give [[YouTube]] user data / log to [[Viacom]] to support its case in a billion-dollar [[copyright]] lawsuit against Google. Google and [[Viacom]], however, on [[July 14]], 2008, agreed in [[compromise]] to protect [[YouTube]] users' personal data in the $ 1 billion (£ 497 million) copyright lawsuit. Google agreed it will make user information and internet protocol addresses from its YouTube subsidiary anonymous before handing over the data to Viacom. The privacy deal also applied to other litigants including the [[FA Premier League]], the Rodgers & Hammerstein Organisation and the [[Scottish Premier League]]. The deal however did not extend the anonymity to employees, since Viacom would prove that Google staff are aware of uploading of illegal material to the site. The parties therefore will further meet on the matter lest the data be made available to the court.
Google Translate
'''Google Translate''' is a service provided by [[Google|Google Inc.]] to translate a section of text, or a webpage, into another language, with limits to the number of paragraphs, or range of technical terms, translated. For some languages, users are asked for alternate translations, such as for technical terms, to be included for future updates to the translation process.
Unlike other translation services such as [[Babel Fish (website)|Babel Fish]], [[AOL]], and [[Yahoo!|Yahoo]] which use [[SYSTRAN]], Google uses its own translation software.
== Functions ==
The service also includes translation of an entire Web page. The translation is limited in number of paragraphs per webpage (such as indicated by break-tags <br>); however, if text on a webpage is separated by horizontal blank-line images (auto-wrapped without using any <br>), a long webpage can be translated containing several thousand words.
Google Translate, like other automatic translation tools, has its limitations. While it can help the reader to understand the general content of a foreign language text, it does not deliver accurate translations and does not produce publication-standard content, for example it often translates words out of context and is deliberately not applying any [[Grammar|grammatical]] rules.
== Approach ==
Google translate is based on an approach called [[statistical machine translation]], and more specifically, on research by [[Franz-Josef Och]] who won the [[DARPA]] contest for speed machine translation in 2003. Och is now the head of Google's machine translation department.
According to Och, a solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual [[text corpus]] (or [[parallel text|parallel collection]]) of more than a million words and two monolingual corpora of each more than a billion words. Statistical [[Mathematical model|models]] from this data are then used to translate between those languages.
To acquire this huge amount of linguistic data, Google used [[United Nations]] documents. The same document is normally available in all six official UN languages, thus Google now has a hectalingual corpus of 20 billion words' worth of human translations.
The availability of Arabic and Chinese as official UN languages is probably one of the reasons why Google Translate initially focused on the development of translation between English and those languages, and not, for example, [[Japanese language|Japanese]] and [[German language|German]], which are not official languages at the UN.
Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.
== Options ==
(by chronological order)
*Beginning
**English to Arabic
**English to French
**English to German
**English to Spanish
**French to English
**German to English
**Spanish to English
**Arabic to English
*2nd stage
**English to Portuguese
**Portuguese to English
*3rd stage
**English to Italian
**Italian to English
*4th stage
**English to Chinese (Simplified) BETA
**English to Japanese BETA
**English to Korean BETA
**Chinese (Simplified) to English BETA
**Japanese to English BETA
**Korean to English BETA
*5th stage
**English to Russian BETA
**Russian to English BETA
*6th stage
**English to Arabic BETA
**Arabic to English BETA
*7th stage (launched February, 2007)
**English to Chinese (Traditional) BETA
**Chinese (Traditional) to English BETA
**Chinese (Simplified to Traditional) BETA
**Chinese (Traditional to Simplified) BETA
*8th stage (launched October, 2007)
** all 25 language pairs use Google's machine translation system
*9th stage
**English to Hindi BETA
**Hindi to English BETA
*10th stage (as of this stage, translation can be done between any two languages)
**Bulgarian
**Croatian
**Czech
**Danish
**Dutch
**Finnish
**Greek
**Norwegian
**Polish
**Romanian
**Swedish
Grammar
'''Grammar''' is the field of [[linguistics]] that covers the [[rules]] governing the use of any given [[natural language|natural language]]. It includes [[morphology (linguistics)|morphology]] and [[syntax]], often complemented by [[phonetics]], [[phonology]], [[semantics]], and [[pragmatics]].
Each language has its own distinct grammar. "English grammar" is the rules of the English language itself. "''An'' English grammar" is a specific study or analysis of these rules. A [[reference book]] describing the grammar of a language is called a "reference grammar" or simply "a grammar". A fully explicit grammar exhaustively describing the [[grammaticality|grammatical]] constructions of a language is called a descriptive grammar, as opposed to [[linguistic prescription]] which tries to enforce the governing rules how a language is to be used. [[Grammatical framework]]s are approaches to constructing grammars. The standard framework of [[generative grammar]] is the [[transformational grammar]] model developed by [[Noam Chomsky]] and his followers from the 1950s to 1980s.
==Etymology==
The word "grammar," derives from [[Greek language|Greek]] ''γραμματική τέχνη'' (''grammatike techne''), which means "art of letters," from ''γράμμα'' (''gramma''), "letter," and that from ''γράφειν'' (''graphein''), "to draw, to write".
==History==
The first systematic grammars originate in [[Iron Age India]], with [[Panini (grammarian)|Panini]] (4th c. BC) and his commentators [[Pingala]] (ca. 200 BC), [[Katyayana]], and [[Patanjali]] (2nd c. BC). In the West, grammar emerges as a discipline in [[Hellenism]] from the 3rd c. BC forward with authors like [[Rhyanus]] and [[Aristarchus of Samothrace]], the oldest extant work being the ''[[Art of Grammar]]'' ({{lang|grc|Τέχνη Γραμματική}}), attributed to [[Dionysius Thrax]] (ca. 100 BC). [[Latin grammar]] developed by following Greek models from the 1st century BC, due to the work of authors such as [[Orbilius Pupillus]], [[Remmius Palaemon]], [[Marcus Valerius Probus]], [[Verrius Flaccus]], [[Aemilius Asper]].
Tamil grammatical tradition also began around the 1st century BC with the [[Tolkāppiyam]].
A grammar of [[Old Irish|Irish]] originated in the 7th century with the [[Auraicept na n-Éces]]. [[Arabic grammar]] emerges from the 8th century with the work of [[Ibn Abi Ishaq]] and his students.
The first treatises on [[Hebrew grammar]] appear in the [[High Middle Ages]], in the context of [[Mishnah]] (exegesis of the [[Hebrew Bible]]). The [[Karaite]] tradition originates in [[Abbasid]] [[Baghdad]]. The ''[[Diqduq]]'' (10th century) is one of the earliest grammatical commentaries on the Hebrew Bible. [[Ibn Barun]] in the 12th century compares the Hebrew language with [[Arabic language|Arabic]] in the [[Islamic grammatical tradition]].
Belonging to the ''trivium'' of the seven [[liberal arts]], grammar was taught as a core discipline throughout the [[Middle Ages]], following the influence of authors from [[Late Antiquity]], such as [[Priscian]]. Treatment of vernaculars begins gradually during the [[High Middle Ages]], with isolated works such as the [[First Grammatical Treatise]], but becomes influential only in the [[Renaissance]] and [[Baroque]] periods. In [[1486]], [[Antonio de Nebrija]] published ''Las introduciones Latinas contrapuesto el romance al Latin'', and the first [[Spanish grammar]], ''Gramática de la lengua castellana'', in 1492. During the 16th century [[Italian Renaissance]], the ''Questione della lingua'' was the discussion on the status and ideal form of the [[Italian language]], initiated by [[Dante]]'s ''[[de vulgari eloquentia]]'' ([[Pietro Bembo]], ''Prose della volgar lingua'' Venice 1525).
Grammars of non-European languages began to be compiled for the purposes of [[evangelization]] and [[Bible translation]] from the 16th century onward, such as ''Grammatica o Arte de la Lengua General de los Indios de los Reynos del Perú'' (1560), and a [[Quechua]] grammar by [[Fray Domingo de Santo Tomás]]. In 1643 there appeared [[Ivan Uzhevych]]'s ''Grammatica sclavonica'' and, in 1762, the ''Short Introduction to English Grammar'' of [[Robert Lowth]] was also published. The ''Grammatisch-Kritisches Wörterbuch der hochdeutschen Mundart'', a [[High German]] grammar in five volumes by [[Johann Christoph Adelung]], appeared as early as 1774.
From the latter part of the 18th century, grammar came to be understood as a subfield of the emerging discipline of modern [[linguistics]]. The Serbian grammar by [[Vuk Stefanović Karadžić]] arrived in 1814, while the ''Deutsche Grammatik'' of the [[Brothers Grimm]] was first published in 1818. The ''Comparative Grammar'' of [[Franz Bopp]], the starting point of modern [[comparative linguistics]], came out in 1833.
In the [[USA]], the Society for the Promotion of Good Grammar has designated March 4, 2008 as National Grammar Day.
==Development of grammars==
Grammars evolve through usage, and grammars also develop due to separations of the human population. With the advent of written [[Knowledge representation|representation]]s, formal rules about language usage tend to appear also. Formal grammars are [[codification (linguistics)|codifications]] of usage that are developed by repeated documentation over time, and by [[observation]] as well. As the rules become established and developed, the prescriptive concept of grammatical correctness can arise. This often creates a discrepancy between contemporary usage and that which has been accepted over time as being correct. Linguists tend to believe that prescriptive grammars do not have any justification beyond their authors' aesthetic tastes; however, prescriptions are considered in [[sociolinguistics]] as part of the explanation for why some people say "I didn't do nothing", some say "I didn't do anything", and some say one or the other depending on social context.
The formal study of grammar is an important part of [[education]] for children from a young age through advanced [[learning]], though the rules taught in schools are not a "grammar" in the sense most [[linguistics|linguists]] use the term, as they are often [[prescriptive]] rather than [[descriptive]]. [[Constructed language]]s (also called planned languages or conlangs) are more common in the modern day. Many have been designed to aid human [[communication]] (for example, naturalistic [[Interlingua]], schematic [[Esperanto]], and the highly logic-compatible artificial language [[Lojban]]). Each of these languages has its own grammar.
No clear line can be drawn between syntax and morphology. [[Analytic languages]] use [[syntax]] to convey information that is encoded via [[inflection]] in [[synthetic language]]s. In other words, word order is not significant and [[morphology (linguistics)|morphology]] is highly significant in a purely synthetic language, whereas morphology is not significant and syntax is highly significant in an analytic language. [[Chinese language|Chinese]] and [[Afrikaans language|Afrikaans]], for example, are highly analytic, and meaning is therefore very context – dependent. (Both do have some inflections, and have had more in the past; thus, they are becoming even less synthetic and more "purely" analytic over time.) [[Latin]], which is highly [[synthetic language|synthetic]], uses [[affix]]es and [[inflection]]s to convey the same information that Chinese does with [[syntax]]. Because Latin words are quite (though not completely) self-contained, an intelligible Latin [[Sentence (linguistics)|sentence]] can be made from elements that are placed in a largely arbitrary order. Latin has a complex affixation and a simple syntax, while Chinese has the opposite.
==Grammar frameworks==
Various "grammar frameworks" have been developed in [[theoretical linguistics]] since the mid 20th century, in particular under the influence of the idea of a "[[Universal grammar]]" in the USA. Of these, the main divisions are:
*[[Transformational grammar]] (TG))
*[[Principles and Parameters|Principles and Parameters Theory]] (P&P)
*[[Lexical functional grammar|Lexical-functional Grammar]] (LFG)
*[[Generalised Phrase Structure Grammar|Generalized Phrase Structure Grammar]] (GPSG)
*[[Head-Driven Phrase Structure Grammar]] (HPSG)
*[[Dependency grammar]]s (DG)
*[[Role and reference grammar]] (RRG)
Hidden Markov model
A '''hidden Markov model''' ('''HMM''') is a [[statistical model]] in which the system being modeled is assumed to be a [[Markov process]] with unknown parameters, and the challenge is to determine the hidden parameters from the [[observable]] parameters. The extracted model parameters can then be used to perform further analysis, for example for [[pattern recognition]] applications. An HMM can be considered as the simplest [[dynamic Bayesian network]].
In a regular [[Markov model]], the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a ''hidden'' Markov model, the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.
Hidden Markov models are especially known for their application in [[time| temporal]] pattern recognition such as [[speech recognition|speech]], [[handwriting recognition|handwriting]], [[gesture recognition]], [[musical score]] following, [[partial discharge]]s and [[bioinformatics]].
== Architecture of a hidden Markov model ==
The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt a number of values. The random variable is the hidden state at time (with the model from the above diagram, ). The random variable is the observation at time (). The arrows in the diagram (often called a [[Trellis (graph)|trellis diagram]]) denote conditional dependencies.
From the diagram, it is clear that the value of the hidden variable (at time ) ''only'' depends on the value of the hidden variable : the values at time and before have no influence. This is called the [[Markov property]]. Similarly, the value of the observed variable only depends on the value of the hidden variable (both at time ).
==Probability of an observed sequence==
The probability of observing a sequence of length is given by
:
where the sum runs over all possible hidden node sequences . Brute force calculation of is intractable for most real-life problems, as the number of possible hidden node sequences is typically extremely high. The calculation can however be sped up enormously using the [[Viterbi algorithm|forward algorithm]] or the equivalent backward algorithm.
==Using hidden Markov models==
There are three [[canonical]] problems associated with HMM:
* Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence. This problem is solved by the [[forward-backward algorithm]].
* Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the [[Viterbi algorithm]].
* Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, discover the parameters of the HMM given a dataset of sequences. This problem is solved by the [[Baum-Welch algorithm]].
=== A concrete example ===
''This example is further elaborated in the [[Viterbi algorithm]] page.''
===Applications of hidden Markov models===
* [[Cryptanalysis]]
* [[Speech recognition]]
* [[Machine translation]]
* [[Partial discharge]]
== History ==
Hidden Markov Models were first described in a series of statistical papers by [[Leonard E. Baum]] and other authors in the second half of the 1960s. One of the first applications of HMMs was [[speech recognition]], starting in the mid-1970s.
In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular [[DNA]]. Since then, they have become ubiquitous in the field of [[bioinformatics]].
HTML
'''HTML''', an [[Acronym and initialism|initialism]] of '''HyperText Markup Language''', is the predominant [[markup language]] for [[web page]]s. It provides a means to describe the structure of text-based information in a document — by denoting certain text as links, headings, paragraphs, lists, and so on — and to supplement that text with ''interactive forms'', embedded ''images'', and other objects. HTML is written in the form of tags, surrounded by [[Brackets#Angle brackets or chevrons .3C .3E|angle brackets]]. HTML can also describe, to some degree, the appearance and [[semantics]] of a document, and can include embedded [[scripting language]] code (such as JavaScript) which can affect the behavior of [[Web browser]]s and other HTML processors.
HTML is also often used to refer to content in specific languages, such as a [[MIME type]] text/html, or even more broadly as a generic term for HTML, whether in its
[[XML]]-descended form (such as [[XHTML]] 1.0 and later) or its form descended directly from [[SGML]] (such as HTML 4.01 and earlier).
By convention, HTML format data files use a file extension .html or .htm.
==History of HTML==
===Origins===
In 1980, physicist [[Tim Berners-Lee]], who was an independent contractor at [[CERN]], proposed and prototyped [[ENQUIRE]], a system for CERN researchers to use and share documents. In 1989, Berners-Lee and CERN data systems engineer [[Robert Cailliau]] each submitted separate proposals for an [[Internet]]-based [[hypertext]] system providing similar functionality. The following year, they collaborated on a joint proposal, the WorldWideWeb (W3) project,
which was accepted by CERN.
===First specifications===
The first publicly available description of HTML was a document called ''HTML Tags'', first mentioned on the Internet by Berners-Lee in late 1991. It describes 22 elements comprising the initial, relatively simple design of HTML. Thirteen of these elements still exist in HTML 4.
Berners-Lee considered HTML to be, at the time, an application of [[SGML]], but it was not formally defined as such until the mid-1993 publication, by the [[Internet Engineering Task Force|IETF]], of the first proposal for an HTML specification: Berners-Lee and [[Dan Connolly]]'s "Hypertext Markup Language (HTML)" Internet-Draft, which included an SGML [[Document Type Definition]] to define the grammar. The draft expired after six months, but was notable for its acknowledgment of the [[Mosaic (web browser)|NCSA Mosaic]] browser's custom tag for embedding in-line images, reflecting the IETF's philosophy of basing standards on successful prototypes. Similarly, Dave Raggett's competing Internet-Draft, "HTML+ (Hypertext Markup Format)", from late 1993, suggested standardizing already-implemented features like tables and fill-out forms.
After the HTML and HTML+ drafts expired in early 1994, the IETF created an HTML Working Group, which in 1995 completed "HTML 2.0", the first HTML specification intended to be treated as a standard against which future implementations should be based. Published as [[Request for Comments]] 1996, HTML 2.0 included ideas from the HTML and HTML+ drafts. There was no "HTML 1.0"; the 2.0 designation was intended to distinguish the new edition from previous drafts.
Further development under the auspices of the IETF was stalled by competing interests. Since 1996, the HTML specifications have been maintained, with input from commercial software vendors, by the [[World Wide Web Consortium]] (W3C). However, in 2000, HTML also became an international standard ([[International Organization for Standardization|ISO]]/[[International Electrotechnical Commission|IEC]] 15445:2000). The last HTML specification published by the W3C is the HTML 4.01 Recommendation, published in late 1999. Its issues and errors were last acknowledged by errata published in 2001.
===Version history of the standard===
====HTML versions====
'''July, 1993:''' [http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt Hypertext Markup Language], was published at [[Internet Engineering Task Force|IETF]] working draft (that is, not yet a standard).
'''November, 1995:''' [http://tools.ietf.org/html/rfc1866 HTML 2.0] published as IETF [[Request for Comments]]:
* RFC 1866,
* supplemented by RFC 1867 (form-based file upload) that same month,
* RFC 1942 (tables) in ''May 1996'',
* RFC 1980 (client-side image maps) in ''August 1996'', and
* RFC 2070 ([[internationalization and localization|internationalization]]) in ''January 1997'';
Ultimately, all were declared obsolete/historic by RFC 2854 in ''June 2000''.
'''April 1995''': [http://www.w3.org/MarkUp/html3/ HTML 3.0], proposed as a standard to the IETF. It included many of the capabilities that were in Raggett's HTML+ proposal, such as support for tables, text flow around figures, and the display of complex mathematical formulas.
A demonstration appeared in W3C's own [[Arena (web browser)|Arena browser]]. HTML 3.0 did not succeed for several reasons. The pace of browser development, as well as the number of interested parties, had outstripped the resources of the IETF.
Netscape continued to introduce HTML elements that specified the visual appearance of documents,
contrary to the goals of the newly-formed W3C, which sought to limit HTML to describing logical structure.
Microsoft, a newcomer at the time, played to all sides by creating its own tags, implementing Netscape's elements for compatibility, and supporting W3C features such as Cascading Style Sheets.
'''[[January 14]], [[1997]]:''' [http://www.w3.org/TR/REC-html32 HTML 3.2], published as a [[W3C Recommendation]]. It was the first version developed and standardized exclusively by the W3C, as the IETF had closed its HTML Working Group in September 1997.
The new version dropped math formulas entirely, reconciled overlap among various proprietary extensions, and adopted most of Netscape's visual markup tags. Netscape's [[blink element]] and Microsoft's [[marquee element]] were omitted due to a mutual agreement between the two companies. The ability to include mathematical formulas in HTML would not be standardized until years later in [[MathML]].
'''[[December 18]], [[1997]]:''' [http://www.w3.org/TR/REC-html40-971218/ HTML 4.0], published as a W3C Recommendation. It offers three "flavors":
* Strict, in which deprecated elements are forbidden,
* Transitional, in which deprecated elements are allowed,
* Frameset, in which mostly only [[Framing (World Wide Web)|frame]] related elements are allowed;
HTML 4.0 (initially code-named "Cougar")
likewise adopted many browser-specific element types and attributes, but at the same time sought to phase out Netscape's visual markup features by marking them as [[deprecation|deprecated]] in favor of style sheets. Minor editorial revisions to the HTML 4.0 specification were published in 1998 without incrementing the version number and further minor revisions as HTML 4.01.
'''[[April 24]], [[1998]]:''' [http://www.w3.org/TR/1998/REC-html40-19980424/ HTML 4.0] was reissued with minor edits without incrementing the version number.
'''[[December 24]], [[1999]]:''' [http://www.w3.org/TR/html401 HTML 4.01], published as a W3C Recommendation. It offers the same three flavors as HTML 4.0, and its last [http://www.w3.org/MarkUp/html4-updates/errata errata] were published [[May 12]], [[2001]].
HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML.
'''[[May 15]], [[2000]]:''' [https://www.cs.tcd.ie/15445/15445.HTML ISO/IEC 15445:2000] ("[[International Organization for Standardization|ISO]] HTML", based on HTML 4.01 Strict), published as an ISO/IEC international standard.
'''[[January 22]], [[2008]]:''' [http://www.w3.org/TR/html5/ HTML 5], published as a Working Draft by W3C.
====XHTML versions====
XHTML is a separate language that began as a reformulation of HTML 4.01 using XML 1.0. It continues to be developed:
* [http://www.w3.org/TR/xhtml1/ XHTML 1.0], published [[January 26]], [[2000]] as a W3C Recommendation, later revised and republished [[August 1]], [[2002]]. It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions.
* [http://www.w3.org/TR/xhtml11/ XHTML 1.1], published [[May 31]], [[2001]] as a W3C Recommendation. It is based on XHTML 1.0 Strict, but includes minor changes, can be customized, and is reformulated using modules from [http://www.w3.org/TR/xhtml-modularization Modularization of XHTML], which was published [[April 10]], [[2001]] as a W3C Recommendation.
* [http://www.w3.org/TR/xhtml2/ XHTML 2.0] is still a W3C Working Draft. XHTML 2.0 is incompatible with XHTML 1.x and, therefore, would be more accurate to characterize as an XHTML-inspired new language than an update to XHTML 1.x.
* XHTML 5, which is an update to XHTML 1.x, is being defined alongside [[HTML 5]] in the [http://www.w3.org/html/wg/html5/ HTML 5 draft].
==HTML markup==
HTML markup consists of several key components, including ''elements'' (and their ''attributes''), character-based ''data types'', and ''character references'' and ''entity references''. Another important component is the ''document type declaration''.
HTML [[Hello world program|Hello World]]:
===Elements===
:''See [[HTML element]]s for more detailed descriptions.''
Elements are the basic structure for HTML markup. Elements have two basic properties: attributes and content. Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid. An element usually has a start tag (e.g. ) and an end tag (e.g. ). The element's attributes are contained in the start tag and content is located between the tags (e.g. Content). Some elements, such as , do not have any content and must not have a closing tag. Listed below are several types of markup elements used in HTML.
'''Structural''' markup describes the purpose of text. For example,
Golf
establishes "Golf" as a second-level [[heading]], which would be rendered in a browser in a manner similar to the "HTML markup" title at the start of this section. Structural markup does not denote any specific rendering, but most Web browsers have standardized on how elements should be formatted. Text may be further styled with [[Cascading Style Sheets]] (CSS).
'''Presentational''' markup describes the appearance of the text, regardless of its function. For example boldface indicates that visual output devices should render "boldface" in bold text, but gives no indication what devices which are unable to do this (such as aural devices that read the text aloud) should do. In the case of both bold and italic, there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely strong emphasis and emphasis respectively. It is easier to see how an aural user agent should interpret the latter two elements. However, they are not equivalent to their presentational counterparts: it would be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen such a name would be italicized. Most presentational markup elements have become [[Deprecation|deprecated]] under the HTML 4.0 specification, in favor of [[Cascading Style Sheets|CSS]] based style design.
'''Hypertext''' markup links parts of the document to other documents. HTML up through version [[XHTML]] 1.1 requires the use of an anchor element to create a hyperlink in the flow of text: Wikipedia. However, the href attribute must also be set to a valid [[Uniform Resource Locator|URL]] so for example the HTML code, Wikipedia, will render the word "[http://en.wikipedia.org/ Wikipedia]" as a [[hyperlink]].To link on an image, the anchor tag use the following syntax:
===Attributes===
Most of the attributes of an element are name-value pairs, separated by "=", and written within the start tag of an element, after the element's name. The value may be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML). Leaving attribute values unquoted is considered unsafe. In contrast with name-value pair attributes, there are some attributes that affect the element simply by their presence in the start tag of the element (like the ismap attribute for the img element).
Most elements can take any of several common attributes:
* The id attribute provides a document-wide unique identifier for an element. This can be used by stylesheets to provide presentational properties, by browsers to focus attention on the specific element, or by scripts to alter the contents or presentation of an element.
* The class attribute provides a way of classifying similar elements for presentation purposes. For example, an HTML document might use the designation class="notation" to indicate that all elements with this class value are subordinate to the main text of the document. Such elements might be gathered together and presented as footnotes on a page instead of appearing in the place where they occur in the HTML source.
* An author may use the style non-attributal codes presentational properties to a particular element. It is considered better practice to use an element’s son- id page and select the element with a stylesheet, though sometimes this can be too cumbersome for a simple ad hoc application of styled properties.
* The title attribute is used to attach subtextual explanation to an element. In most browsers this attribute is displayed as what is often referred to as a [[tooltip]].
The generic inline element span can be used to demonstrate these various attributes:
::
This example displays as HTML; in most browsers, pointing the cursor at the abbreviation should display the title text "Hypertext Markup Language."
Most elements also take the language-related attributes lang and dir.
===Character and entity references===
As of version 4.0, HTML defines a set of [[List of XML and HTML character entity references|252]] [[character entity reference]]s and a set of 1,114,050 [[numeric character reference]]s, both of which allow individual characters to be written via simple markup, rather than literally. A literal character and its markup counterpart are considered equivalent and are rendered identically.
The ability to "escape" characters in this way allows for the characters < and & (when written as < and &, respectively) to be interpreted as character data, rather than markup. For example, a literal < normally indicates the start of a tag, and & normally indicates the start of a character entity reference or numeric character reference; writing it as & or & or & allows & to be included in the content of elements or the values of attributes. The double-quote character ("), when used to quote an attribute value, must also be escaped as " or " or " when it appears within the attribute value itself. The single-quote character ('), when used to quote an attribute value, must also be escaped as ' or ' (should NOT be escaped as ' except in XHTML documents) when it appears within the attribute value itself. However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent.
Escaping also allows for characters that are not easily typed or that aren't even available in the document's [[character encoding]] to be represented within the element and attribute content. For example, the acute-accented e (é), a character typically found only on Western European keyboards, can be written in any HTML document as the entity reference é or as the numeric references é or é. The characters comprising those references (that is, the &, the ;, the letters in eacute, and so on) are available on all keyboards and are supported in all character encodings, whereas the literal é is not.
===Data types===
HTML defines several [[data type]]s for element content, such as script data and stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of length, languages, media descriptors, colors, character encodings, dates and times, and so on. All of these data types are specializations of character data.
===The Document Type Declaration===
In order to enable [[Document Type Definition]] (DTD)-based validation with SGML tools and in order to avoid the [[quirks mode]] in browsers, HTML documents can start with a [[Document Type Declaration]] (informally, a "DOCTYPE"). The DTD to which the DOCTYPE refers contains machine-readable grammar specifying the permitted and prohibited content for a document conforming to such a DTD. Browsers do not necessarily read the DTD, however. The most popular graphical browsers use DOCTYPE declarations (or the lack thereof) and other data at the beginning of sources to determine which rendering mode to use.
For example:
:
This declaration references the Strict DTD of HTML 4.01, which does not have presentational elements like , leaving formatting to Cascading Style Sheets and the span and div tags. SGML-based validators read the DTD in order to properly parse the document and to perform validation. In modern browsers, the HTML 4.01 Strict doctype activates standards layout mode for [[Cascading Style Sheets|CSS]] as opposed to [[quirks mode]].
In addition, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames.
==Semantic HTML==
There is no official specification called "Semantic HTML", though the strict flavors of HTML discussed [[#Current flavors of HTML|below]] are a push in that direction. Rather, semantic HTML refers to an objective and a practice to create documents with HTML that contain only the author's intended meaning, without any reference to how this meaning is presented or conveyed. A classic example is the distinction between the emphasis element (<em>) and the italics element (<i>). Often the emphasis element is displayed in italics, so the presentation is typically the same. However, emphasizing something is different from listing the title of a book, for example, which may also be displayed in italics. In purely semantic HTML, a book title would use a different element than emphasized text uses (for example a <span>), because they are meaningfully different things.
The goal of semantic HTML requires two things of authors:
# To avoid the use of presentational markup (elements, attributes, and other entities).
# To use available markup to differentiate the meanings of phrases and structure in the document. So for example, the book title from above would need to have its own element and class specified, such as <cite class="booktitle">The Grapes of Wrath</cite>. Here, the <cite> element is used because it most closely matches the meaning of this phrase in the text. However, the <cite> element is not specific enough to this task, since we mean to cite specifically a book title as opposed to a newspaper article or an academic journal.
Semantic HTML also requires complementary specifications and software compliance with these specifications. Primarily, the development and proliferation of [[Cascading Style Sheets|CSS]] has led to increasing support for semantic HTML, because CSS provides designers with a rich language to alter the presentation of semantic-only documents. With the development of CSS, the need to include presentational properties in a document has virtually disappeared. With the advent and refinement of CSS and the increasing support for it in Web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the semantic structure and phrasing of the document, like headings, paragraphs, quotes, and lists, instead of using markup which is written for visual purposes only, like <font>, <b> (bold), and <i> (italics). Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document semantics from the content's presentation, by keeping everything relevant to presentation defined in a CSS file. See [[separation of style and content]].
Semantic HTML offers many advantages. First, it ensures consistency in style across elements that have the same meaning. Every heading, every quotation, every similar element receives the same presentation properties.
Second, semantic HTML frees authors from the need to concern themselves with presentation details. When writing the number two, for example, should it be written out in words ("two"), or should it be written as a numeral (2)? A semantic markup might enter something like 2 and leave presentation details to the stylesheet designers. Similarly, an author might wonder where to break out quotations into separate indented blocks of text: with purely semantic HTML, such details would be left up to stylesheet designers. Authors would simply indicate quotations when they occur in the text, and not concern themselves with presentation.
A third advantage is device independence and repurposing of documents. A semantic HTML document can be paired with any number of stylesheets to provide output to computer screens (through Web browsers), high-resolution printers, handheld devices, aural browsers or braille devices for those with visual impairments, and so on. To accomplish this, nothing needs to be changed in a well-coded semantic HTML document. Readily available stylesheets make this a simple matter of pairing a semantic HTML document with the appropriate stylesheets. (Of course, the stylesheet's selectors need to match the appropriate properties in the HTML document.)
Some aspects of authoring documents make separating semantics from style (in other words, meaning from presentation) difficult. Some elements are hybrids, using presentation in their very meaning. For example, a table displays content in a tabular form. Often such content conveys the meaning only when presented in this way. Repurposing a table for an aural device typically involves somehow presenting the table as an inherently visual element in an audible form. On the other hand, we frequently present lyrical songs—something inherently meant for audible presentation—and instead present them in textual form on a Web page. For these types of elements, the meaning is not so easily separated from their presentation. However, for a great many of the elements used and meanings conveyed in HTML, the translation is relatively smooth.
==Delivery of HTML==
HTML documents can be delivered by the same means as any other computer file; however, they are most often delivered in one of two forms: over [[HTTP]] servers and through e-mail.
===Publishing HTML with HTTP===
The [[World Wide Web]] is composed primarily of HTML documents transmitted from a [[Web server]] to a Web browser using the [[Hypertext Transfer Protocol]] (HTTP). However, HTTP can be used to serve images, sound, and other content in addition to HTML. To allow the Web browser to know how to handle the document it received, an indication of the [[file format]] of the document must be transmitted along with the document. This vital [[metadata]] includes the [[MIME]] type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see [[Character encodings in HTML]]).
In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted. A document sent with an XHTML MIME type, or ''served as application/xhtml+xml'', is expected to be [[XML#Well-formed documents|well-formed]] XML, and a syntax error causes the browser to fail to render the document. The same document sent with an HTML MIME type, or ''served as text/html'', might be displayed successfully, since Web browsers are more lenient with HTML. However, XHTML parsed in this way is not considered either proper XHTML or HTML, but so-called [[tag soup]].
If the MIME type is not recognized as HTML, the Web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration. Nevertheless, some Web browsers do examine the contents or URL of the document and attempt to infer the file type, despite this being forbidden by the HTTP 1.1 specification.
===HTML e-mail===
Most graphical [[e-mail]] clients allow the use of a subset of HTML (often ill-defined) to provide formatting and [[semantic web|semantic]] markup capabilities not available with [[plain text]], like emphasized text, block quotations for replies, and diagrams or mathematical formulas that could not easily be described otherwise. Many of these clients include both a [[GUI]] editor for composing HTML e-mail messages and a rendering engine for displaying received HTML messages. Use of HTML in e-mail is controversial because of compatibility issues, because it can be used in [[phishing]]/privacy attacks, because it can confuse [[E-Mail spam|spam]] filters, and because the message size is larger than plain text.
===Naming conventions===
The most common [[filename extension]] for [[computer file|files]] containing HTML is .html. A common abbreviation of this is .htm; it originates from older operating systems and file systems, such as the [[DOS]] versions from the 80s and early 90s and [[File Allocation Table|FAT]], which limit file extensions to three letters. Both forms are widely supported by browsers.
==Current flavors of HTML==
Since its inception, HTML and its associated protocols gained acceptance relatively quickly. However, no clear standards existed in the early years of the language. Though its creators originally conceived of HTML as a semantic language devoid of presentation details, practical uses pushed many presentational elements and attributes into the language, driven largely by the various browser vendors. The latest standards surrounding HTML reflect efforts to overcome the sometimes chaotic development of the language and to create a rational foundation for building both meaningful and well-presented documents. To return HTML to its role as a semantic language, the [[World Wide Web Consortium|W3C]] has developed style languages such as [[Cascading Style Sheets|CSS]] and [[Extensible Stylesheet Language|XSL]] to shoulder the burden of presentation. In conjunction, the HTML specification has slowly reined in the presentational elements.
There are two axes differentiating various flavors of HTML as currently specified: SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis, and strict versus transitional (loose) versus frameset on the other axis.
===SGML-based versus XML-based HTML===
One difference in the latest HTML specifications lies in the distinction between the SGML-based specification and the XML-based specification. The XML-based specification is usually called XHTML to distinguish it clearly from the more traditional definition; however, the root element name continues to be 'html' even in the XHTML-specified HTML. The W3C intended XHTML 1.0 to be identical to HTML 4.01 except where limitations of XML over the more complex SGML require workarounds. Because XHTML and HTML are closely related, they are sometimes documented in parallel. In such circumstances, some authors conflate the two names as (X)HTML or X(HTML).
Like HTML 4.01, XHTML 1.0 has three sub-specifications: strict, loose, and frameset.
Aside from the different opening declarations for a document, the differences between an HTML 4.01 and XHTML 1.0 document—in each of the corresponding DTDs—are largely syntactic. The underlying syntax of HTML allows many shortcuts that XHTML does not, such as elements with optional opening or closing tags, and even EMPTY elements which must not have an end tag. By contrast, XHTML requires all elements to have an opening tag or a closing tag. XHTML, however, also introduces a new shortcut: an XHTML tag may be opened and closed within the same tag, by including a slash before the end of the tag like this: <br/>. The introduction of this shorthand, which is not used in the SGML declaration for HTML 4.01, may confuse earlier software unfamiliar with this new convention.
To understand the subtle differences between HTML and XHTML, consider the transformation of a valid and well-formed XHTML 1.0 document that adheres to Appendix C (see below) into a valid HTML 4.01 document. To make this translation requires the following steps:
# '''The language for an element should be specified with a lang attribute rather than the XHTML xml:lang attribute.''' XHTML uses XML's built in language-defining functionality attribute.
# '''Remove the XML namespace (xmlns=URI).''' HTML has no facilities for namespaces.
# '''Change the document type declaration''' from XHTML 1.0 to HTML 4.01. (see [[#The Document Type Definition|DTD section]] for further explanation).
# If present, '''remove the XML declaration.''' (Typically this is: ).
# '''Ensure that the document’s MIME type is set to text/html.''' For both HTML and XHTML, this comes from the HTTP Content-Type header sent by the server.
# '''Change the XML empty-element syntax to an HTML style empty element''' (<br/> to <br>).
Those are the main changes necessary to translate a document from XHTML 1.0 to HTML 4.01. To translate from HTML to XHTML would also require the addition of any omitted opening or closing tags. Whether coding in HTML or XHTML it may just be best to always include the optional tags within an HTML document rather than remembering which tags can be omitted.
A well-formed XHTML document adheres to all the syntax requirements of XML. A valid document adheres to the content specification for XHTML, which describes the document structure.
The W3C recommends several conventions to ensure an easy migration between HTML and XHTML (see [http://www.w3.org/TR/xhtml1/#guidelines HTML Compatibility Guidelines]). The following steps can be applied to XHTML 1.0 documents only:
* Include both xml:lang and lang attributes on any elements assigning language.
* Use the empty-element syntax only for elements specified as empty in HTML.
* Include an extra space in empty-element tags: for example <br /> instead of <br/>.
* Include explicit close tags for elements that permit content but are left empty (for example, <div></div>, not <div />).
* Omit the XML declaration.
By carefully following the W3C’s compatibility guidelines, a user agent should be able to interpret the document equally as HTML or XHTML. For documents that are XHTML 1.0 and have been made compatible in this way, the W3C permits them to be served either as HTML (with a text/html [[MIME type]]), or as XHTML (with an application/xhtml+xml or application/xml MIME type). When delivered as XHTML, browsers should use an XML parser, which adheres strictly to the XML specifications for parsing the document's contents.
===Transitional versus Strict ===
The latest SGML-based specification HTML 4.01 and the earliest XHTML version include three sub-specifications: Strict, Transitional (once called Loose), and Frameset. The Strict variant represents the standard proper, whereas the Transitional and Frameset variants were developed to assist in the transition from earlier versions of HTML (including HTML 3.2). The Transitional and Frameset variants allow for [[presentational markup]] whereas the Strict variant encourages the use of style sheets through its omission of most presentational markup.
The primary differences which make the Transitional variant more permissive than the Strict variant (the differences as the same in HTML 4 and XHTML 1.0) are:
* '''A looser content model'''
** Inline elements and plain text (#PCDATA) are allowed directly in: body, blockquote, form, noscript and noframes
* '''Presentation related elements'''
** underline (u)
** strike-through (del)
** center
** font
** basefont
* '''Presentation related attributes'''
** background and bgcolor attributes for body element.
** align attribute on div, form, paragraph (p), and heading (h1...h6) elements
** align, noshade, size, and width attributes on hr element
** align, border, vspace, and hspace attributes on img and object elements
** align attribute on legend and caption elements
** align and bgcolor on table element
** nowrap, bgcolor, width, height on td and th elements
** bgcolor attribute on tr element
** clear attribute on br element
** compact attribute on dl, dir and menu elements
** type, compact, and start attributes on ol and ul elements
** type and value attributes on li element
** width attribute on pre element
* '''Additional elements in Transitional specification'''
** menu list (no substitute, though unordered list is recommended; may return in XHTML 2.0 specification)
** dir list (no substitute, though unordered list is recommended)
** isindex (element requires server-side support and is typically added to documents server-side)
** applet (deprecated in favor of object element)
* '''The language attribute on script element''' (presumably redundant with type attribute, though this is maintained for legacy reasons).
* '''Frame related entities'''
** frameset element (used in place of body for frameset DTD)
** frame element
** iframe
** noframes
** target attribute on anchor, client-side image-map (imagemap), link, form, and base elements
===Frameset versus transitional===
In addition to the above transitional differences, the frameset specifications (whether XHTML 1.0 or HTML 4.01) specifies a different content model:
=== Summary of flavors ===
As this list demonstrates, the loose flavors of the specification are maintained for legacy support. However, contrary to popular misconceptions, the move to XHTML does not imply a removal of this legacy support. Rather the X in XML stands for extensible and the W3C is modularizing the entire specification and opening it up to independent extensions. The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is the modularization of the entire specification. The strict version of HTML is deployed in XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification. Likewise someone looking for the loose (transitional) or frameset specifications will find similar extended XHTML 1.1 support (much of it is contained in the legacy or frame modules). The modularization also allows for separate features to develop on their own timetable. So for example XHTML 1.1 will allow quicker migration to emerging XML standards such as [[MathML]] (a presentational and semantic math language based on XML) and [[XForms]] — a new highly advanced web-form technology to replace the existing HTML forms.
In summary, the HTML 4.01 specification primarily reined in all the various HTML implementations into a single clear written specification based on SGML. XHTML 1.0, ported this specification, as is, to the new XML defined specification. Next, XHTML 1.1 takes advantage of the extensible nature of XML and modularizes the whole specification. XHTML 2.0 will be the first step in adding new features to the specification in a standards-body-based approach.
== Hypertext features not in HTML ==
HTML lacks some of the features found in earlier hypertext systems, such as [[typed link]]s, [[transclusion]], [[source tracking]], [[fat link]]s, and more. Even some hypertext features that were in early versions of HTML have been ignored by most popular web browsers until recently, such as the [[Hyperlink|link]] element and in-browser Web page editing.
Sometimes Web services or browser manufacturers remedy these shortcomings. For instance, [[wiki]]s and [[content management system]]s allow surfers to edit the Web pages they visit.
IBM
'''International Business Machines Corporation,''' abbreviated '''IBM''' and nicknamed '''"Big Blue,"''' , is a [[multinational corporation|multinational]] [[computer]] [[technology]] and [[consulting]] [[corporation]] headquartered in [[Armonk, New York]], [[United States of America|USA]]. The company is one of the few [[information technology]] companies with a continuous history dating back to the 19th century. IBM manufactures and sells computer [[computer hardware|hardware]] and [[computer software|software]], and offers infrastructure services, [[Internet hosting service|hosting services]], and [[consultant|consulting services]] in areas ranging from [[mainframe computer]]s to [[nanotechnology]].
IBM has been known through most of its recent history as the world's largest computer company; with over 388,000 employees worldwide, IBM is the largest [[information technology]] employer in the world. Despite falling behind [[Hewlett-Packard]] in total revenue since 2006, it remains the most profitable.
IBM holds more [[patent]]s than any other U.S. based technology company. It has engineers and consultants in over 170 countries and [[IBM Research]] has eight laboratories worldwide. IBM employees have earned three [[Nobel Prize]]s, four [[Turing Award]]s, five [[National Medal of Technology|National Medals of Technology]], and five [[National Medal of Science|National Medals of Science]]. As a chip maker, IBM has been among the [[Worldwide Top 20 Semiconductor Sales Leaders]] in past years, and in 2007 IBM ranked second in the list of largest software companies in the world.
==History==
The company which became IBM was founded in 1896 as the Tabulating Machine Company by [[Herman Hollerith]], in [[Broome County, New York]] ([[Endicott, New York]], Where it still maintains very limited operations). It was incorporated as [[Computing Tabulating Recording Corporation (CTR)]] on [[June 16]], [[1911]], and was listed on the [[New York Stock Exchange]] in 1916. IBM adopted its current name in 1924, when it became a [[Fortune 500]] company.
In the 1950s, IBM became the dominant vendor in the emerging [[computer]] industry with the release of the [[IBM 701]] and other models in the [[IBM 700/7000 series]] of [[mainframes]]. The company's dominance became even more pronounced in the 1960s and 1970s with the [[IBM System/360]] and [[IBM System/370]] mainframes, however antitrust actions by the [[United States Department of Justice]], the rise of [[minicomputer]] companies like [[Digital Equipment Corporation]] and [[Data General]], and the introduction of the [[microprocessor]] all contributed to dilution of IBM's position in the industry, eventually leading the company to diversify into other areas including personal computers, software, and services.
In 1981 IBM introduced the [[IBM Personal Computer]] which is the original version and progenitor of the [[IBM PC compatible]] hardware [[platform (computing)|platform]]. Descendants of the IBM PC compatibles make up the majority of [[microcomputer]]s on the market today. IBM sold its PC division to the Chinese company [[Lenovo]] on [[May 1]], [[2005]] for $655 million in cash and $600 million in Lenovo stock.
On [[January 25]], [[2007]], [[Ricoh]] announced purchase of IBM Printing Systems Division for $725 million and investment in 3-year joint venture to form a new Ricoh subsidiary, [[InfoPrint Solutions Company]]; Ricoh will own a 51% share, and IBM will own a 49% share in ''InfoPrint''.
===Controversies===
The author [[Edwin Black]] has alleged that, during [[World War II]], IBM CEO [[Thomas J. Watson]] used overseas subsidiaries to provide the [[Third Reich]] with [[Unit record equipment|unit record]] [[data processing]] machines, supplies and services that helped the [[Nazis]] to efficiently track down European Jews, with sizable profits for the company. IBM denies that they had control over these subsidiaries after the Nazis took power. A lawsuit against IBM based on these allegations was dismissed.
In support of the Allied war effort in World War II, from 1943 to 1945 IBM produced approximately 346,500 M1 Carbine (Caliber .30 carbine) light rifles for the U.S. Military.
==Current projects==
===Eclipse===
Eclipse is a platform-independent, [[Java (programming language)|Java]]-based [[software framework]]. Eclipse was originally a [[Proprietary software|proprietary]] product developed by IBM as a successor of the [[VisualAge]] family of tools. Eclipse has subsequently been released as [[free software|free]]/[[open source]] software under the [[Eclipse Public License]].
===developerWorks===
developerWorks is a website run by [[IBM]] for [[software developer]]s and IT professionals. It contains a large number of how-to articles and tutorials, as well as software downloads and code samples, discussion forums, podcasts, blogs, wikis, and other resources for developers and technical professionals. Subjects range from open, industry-standard technologies like [[Java (programming language)|Java]], [[Linux]], [[Service-oriented architecture|SOA]] and [[web services]], [[web development]], [[Ajax (programming)|Ajax]], [[PHP]], and [[XML]] to IBM's products ([[WebSphere]], [[Rational Software|Rational]], [[Lotus Software|Lotus]], [[Tivoli Systems, Inc.|Tivoli]] and [[IBM DB2|DB2]]). In 2007 developerWorks was inducted into the Jolt Hall of Fame.
===alphaWorks===
alphaWorks is IBM's source for emerging software technologies. These technologies include:
*'''Flexible Internet Evaluation Report Architecture''' - A highly flexible architecture for the design, display, and reporting of Internet surveys.
*'''[[IBM History Flow tool|IBM History Flow Visualization Application]]''' - A tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors.
*'''IBM [[Linux]] on POWER Performance Simulator''' - A tool that provides users of Linux on Power a set of performance models for IBM's POWER processors.
*'''Database File Archive And Restoration Management''' - An application for archiving and restoring hard disk files using file references stored in a database.
*'''Policy Management for Autonomic Computing''' - A policy-based autonomic management infrastructure that simplifies the automation of IT and business processes.
*'''FairUCE''' - A spam filter that verifies sender identity instead of filtering content.
*'''Unstructured Information Management Architecture (UIMA) SDK''' - A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information.
*'''Accessibility Browser''' - A web-browser specifically designed to assist people with visual impairments, to be released as open-source software. Also known as the "A-Browser," the technology will aim to eliminate the need for a mouse, relying instead completely on voice-controls, buttons and predefined shortcut keys.
===Semiconductor design and manufacturing===
Virtually all modern [[video game console|console gaming systems]] use [[IC design|microprocessors developed]] by IBM. The [[Xbox 360]] contains the [[Xenon (processor)|Xenon]] tri-core processor, which was designed and produced by IBM in less than 24 months. Sony's [[PlayStation 3]] features the [[Cell microprocessor| Cell BE microprocessor]] designed jointly by IBM, [[Toshiba]], and [[Sony]]. [[Nintendo]]'s [[History of video game consoles (seventh generation)|seventh-generation]] console, [[Wii]], features an IBM chip codenamed [[Broadway (microprocessor)|Broadway]]. The older [[Nintendo GameCube]] also utilizes the [[Gekko (microprocessor)|Gekko]] processor, designed by IBM.
In May 2002, IBM and Butterfly.net, Inc. announced the Butterfly Grid, a commercial [[grid computing|grid]] for the online video gaming market. In March 2006, IBM announced separate agreements with Hoplon Infotainment, Online Game Services Incorporated (OGSI), and RenderRocket to provide on-demand content management and [[blade server]] computing resources.
===Open Client Offering===
IBM announced it will launch its new software, called "Open Client Offering" which is to run on [[Microsoft]]'s [[Microsoft Windows|Windows]], [[Linux]] and [[Apple Inc.|Apple]]'s [[Macintosh]]. The company states that its new product allows businesses to offer employees a choice of using the same software on Windows and its alternatives. This means that "Open Client Offering" is to cut costs of managing whether Linux or Apple relative to Windows. There will be no necessity for companies to pay Microsoft for its licenses for operations since the operations will no longer rely on software which is Windows-based. One of Microsoft's office alternatives is the Open Document Format software, whose development IBM supports. It is going to be used for several tasks like: word processing, presentations, along with collaboration with [[Lotus Notes]], instant messaging and blog tools as well as an [[Internet Explorer]] competitor – the [[Firefox]] web browser. IBM plans to install Open Client on 5 percent of its desktop PCs.
===UC2: Unified Communications and Collaboration===
'''UC2''' (''Unified Communications and Collaboration'') is an IBM and [[Cisco]] joint project based on [[Eclipse (software)|Eclipse]] and [[OSGi]]. It will offer the numerous Eclipse application developers a unified platform for an easier work environment.
The software based on UC2 platform will provide major enterprises with easy-to-use communication solutions, such as the Lotus based [[Sametime]]. In the future the Sametime users will benefit from such additional functions as [[click-to-call]] and [[Voicemail|voice mailing]].
===Internal programs===
[[Extreme Blue]] is a company initiative that uses experienced IBM engineers, talented interns, and business managers to develop high-value technology. The project is designed to analyze emerging business needs and the technologies that can solve them. These projects mostly involve rapid-prototyping of high-profile software and hardware projects.
In May 2007, IBM unveiled [[Project Big Green]] -- a re-direction of $1 billion per year across its businesses to increase energy efficiency.
==IBM Software Group==
This group is one of the major divisions of IBM. The various brands include:
* [[IBM Information Management Software|Information Management Software]] — database servers and tools, text analytics, content management, business process management and business intelligence.
* [[Lotus Software]] — Groupware, collaboration and business software. Acquired in 1995.
* [[Rational Software]] — Software development and application lifecycle management. Acquired in 2002.
* [[Tivoli Software]] — Systems management. Acquired in 1996.
* [[IBM WebSphere|WebSphere]] — Integration and application infrastructure software.
==Environmental record==
IBM has a long history of dealing with its environmental problems. It established a corporate policy on environmental protection in 1971, with the support of a comprehensive global environmental management system. According to IBM’s stats, its total hazardous waste decreased by 44 percent over the past five years, and has decreased by 94.6 percent since 1987. IBM's total hazardous waste calculation consists of waste from both non-manufacturing and manufacturing operations. Waste from manufacturing operations includes waste recycled in closed-loop systems where process chemicals are recovered and for subsequent reuse, rather than just disposing and using new chemical materials. Over the years, IBM has redesigned processes to eliminate almost all closed loop recycling and now uses more environmental-friendly materials in their place.
IBM was recognized as one of the "Top 20 Best Workplaces for Commuters" by the U.S. Environmental Protection Agency ([[EPA]]) in 2005. This was to recognize the Fortune 500 companies that provided their employees with excellent commuter benefits that helped reduce traffic and air pollution.
However, the birthplace of IBM, [[Endicott, New York|Endicott]], suffered IBM's pollution for decades. IBM used liquid cleaning agents in its circuit board assembly operation for more than two decades, and six spills and leaks incidents were recorded, including one 1979 leak of 4,100 gallons from an underground tank. These left behind volatile organic compounds in the town's soil and aquifer. Trace elements of volatile organic compounds have been identified in the Endicott’s drinking water, but the levels are within regulatory limits. Also, from 1980, IBM has pumped out 78,000 gallons of chemicals, including trichloroethane, Freon, benzene and perchloroethene to the air and allegedly caused several cancer cases among the villagers. IBM Endicott has been identified by the Department of Environmental Conservation as the major source of pollution, though traces of contaminants from a local dry cleaner and other polluters were also found. Despite the amount of pollutant, state health officials cannot say whether air or water pollution in Endicott has actually caused any health problems. Village officials say tests show that the water is safe to drink.
=== Solar power ===
Tokyo Ohka Kogyo Co., Ltd. (TOK) and IBM are collaborating to establish new, low-cost methods for bringing the next generation of solar energy products to market,this is, [[CIGS]] (Copper-Indium-Gallium-Selenide) [[solar cell]] modules. Use of [[thin film]] technology, such as CIGS, has great promise in reducing the overall cost of solar cells and further enabling their widespread adoption.
IBM is exploring four main areas of photovoltaic research: using current technologies to develop cheaper and more efficient [[silicon]] [[solar cell]]s, developing new solution processed [[thin film]] photovoltaic devices, [[concentrator photovoltaics]], and future generation photovoltaic architectures based upon [[nanostructures]] such as [[semiconductor quantum dot]]s and [[nanowire]]s.
Dr. Supratik Guha is the leading scientist in IBM photovoltaics.
==Corporate culture of IBM==
'''Big Blue''' is a nickname for IBM; several theories exist regarding its origin. One theory, substantiated by people who worked for IBM at the time, is that IBM field reps coined the term in the 1960s, referring to the color of the mainframes IBM installed in the 1960s and early 1970s. "All blue" was a term used to describe a loyal IBM customer, and business writers later picked up the term. Another theory suggests that Big Blue simply refers to the Company's [[logo]]. A third theory suggests that Big Blue refers to a former company dress code that required many IBM employees to wear only white shirts and many wore blue suits. In any event, IBM keyboards, typewriters, and some other manufactured devices, have played on the "Big Blue" concept, using the color for enter keys and carriage returns.
===Sales===
IBM has often been described as having a sales-centric or a sales-oriented business culture. Traditionally, many IBM executives and general managers are chosen from the sales force. The current CEO, [[Sam Palmisano]], for example, joined the company as a salesman and, unusually for CEOs of major corporations, has no MBA or postgraduate qualification. Middle and top management are often enlisted to give direct support to salesmen when pitching sales to important customers.
===The uniform===
A dark (or gray) suit, white shirt, and a "sincere" tie was the public uniform for IBM employees for most of the 20th Century. During IBM's management transformation in the 1990s, CEO [[Lou Gerstner]] relaxed these codes, normalizing the dress and behavior of IBM employees to resemble their counterparts in other large technology companies.
===IBM company values and "Jam"===
In 2003, IBM embarked on an ambitious project to rewrite company values. Using its ''Jam'' technology, the company hosted Intranet-based online discussions on key business issues with 50,000 employees over 3 days. The discussions were analyzed by sophisticated text analysis software (eClassifier) to mine online comments for themes. As a result of the 2003 Jam, the company values were updated to reflect three modern business, marketplace and employee views: "Dedication to every client's success", "Innovation that matters - for our company and for the world", "Trust and personal responsibility in all relationships".
In 2004, another Jam was conducted during which 52,000 employees exchanged best practices for 72 hours. They focused on finding actionable ideas to support implementation of the values previously identified. A new post-Jam Ratings event was developed to allow IBMers to select key ideas that support the values. The board of directors cited this Jam when awarding Palmisano a pay rise in the spring of 2005.
In July and September 2006, Palmisano launched another jam called [https://www.globalinnovationjam.com/ InnovationJam]. InnovationJam was the largest online brainstorming session ever with more than 150,000 participants from 104 countries. The participants were IBM employees, members of IBM employees' families, universities, partners, and customers. InnovationJam was divided in two sessions (one in July and one in September) for 72 hours each and generated more than 46,000 ideas. In November 2006, IBM declared that they will invest $US 100 million in the 10 best ideas from InnovationJam.
===Open source===
IBM has been influenced by the [[Open Source Initiative]], and began supporting [[Linux]] in 1998. The company invests billions of dollars in services and software based on [[Linux]] through the IBM [[Linux Technology Center]], which includes over 300 [[Linux kernel]] developers. IBM has also released code under different [[open-source license]]s, such as the platform-independent software framework [[Eclipse (software)|Eclipse]] (worth approximately US$40 million at the time of the donation) and the [[Java (programming language)|Java]]-based [[relational database management system]] (RDBMS) [[Apache Derby]]. IBM's open source involvement has not been trouble-free, however (see ''[[SCO v. IBM]]'').
== Corporate affairs ==
=== Diversity and workforce issues ===
IBM's efforts to promote workforce diversity and equal opportunity date back at least to [[World War I]], when the company hired disabled veterans. IBM was the only technology company ranked in ''Working Mother'' magazine's Top 10 for 2004, and one of two technology companies in 2005 (the other company being Hewlett-Packard).
On [[September 21]], [[1953]], [[Thomas J. Watson]], the CEO at the time, sent out a very controversial letter to all IBM employees stating that IBM needed to hire the best people, regardless of their race, ethnic origin, or gender. In 1984, IBM added sexual preference. He stated that this would give IBM a competitive advantage because IBM would then be able to hire talented people its competitors would turn down.
The company has traditionally resisted [[trade union|labor union]] organizing, although unions represent some IBM workers outside the United States.
In the 1990s, two major [[pension]] program changes, including a conversion to a cash balance plan, resulted in an employee [[class action]] lawsuit alleging [[age discrimination]]. IBM employees won the lawsuit and arrived at a partial settlement, although appeals are still underway. IBM also settled a major overtime class-action lawsuit in 2006.
Historically IBM has had a good reputation of long-term staff retention with few large scale layoffs. In more recent years there have been a number of broad sweeping cuts to the workforce as IBM attempts to adapt to changing market conditions and a declining profit base. After posting weaker than expected revenues in the first quarter of 2005, IBM eliminated 14,500 positions from its workforce, predominantly in Europe. In May 2005, IBM Ireland said to staff that the MD(Micro-electronics Division) facility was closing down by the end of 2005 and offered a settlement to staff. However, all staff that wished to stay with the Company were redeployed within IBM Ireland. The production moved to a company called Amkor in Singapore who purchased IBM's Microelectronics business in Singapore and is widely agreed that IBM promised this Company a full load capacity in return for the purchase of the facility. On [[June 8]] [[2005]], IBM Canada Ltd. eliminated approximately 700 positions. IBM projects these as part of a strategy to "rebalance" its portfolio of professional skills & businesses. [[IBM India]] and other IBM offices in [[China]], the [[Philippines]] and [[Costa Rica]] have been witnessing a recruitment boom and steady growth in number of employees due to lower wages.
On [[October 10]] [[2005]], IBM became the first major company in the world to formally commit to not using [[genetic testing|genetic information]] in its employment decisions. This came just a few months after IBM announced its support of the [[National Geographic Society]]'s [[The Genographic Project|Genographic Project]].
==== Gay rights ====
IBM provides employees' same-sex partners with benefits and provides an anti-discrimination clause. The [[Human Rights Campaign]] has consistently rated IBM 100% on its index of gay-friendliness since 2003 (in 2002, the year it began compiling its report on major companies, IBM scored 86%).
===Logos===
[[Logo]]s designed in the 1970s tended to be sensitive to the technical limitations of photocopiers, which were then being widely deployed. A logo with large solid areas tended to be poorly copied by copiers in the 1970s, so companies preferred logos that avoided large solid areas. The 1972 IBM logos are an example of this tendency. With the advent of digital copiers in the mid-1980s this technical restriction had largely disappeared; at roughly the same time, the 13-bar logo was abandoned for almost the opposite reason it was difficult to render accurately on the low-resolution digital printers (240 dots per inch) of the time.
===Board of directors===
Current members of the [[board of directors]] of IBM are:
*Cathleen Black President, [[Hearst Corporation|Hearst Magazines]]
*[[William Brody]] President, [[Johns Hopkins University]]
*[[Ken Chenault]] Chairman and CEO, [[American Express]] Company
*Juergen Dormann Chairman of the Board, ABB Ltd
*[[Michael Eskew]] Chairman and CEO, [[United Parcel Service]], Inc.
*[[Shirley Ann Jackson]] President, [[Rensselaer Polytechnic Institute]]
*Minoru Makihara Senior Corporate Advisor and former Chairman, [[Mitsubishi Corporation]]
*Lucio Noto Managing Partner, Midstream Partners LLC
*[[James W. Owens]] Chairman and CEO, [[Caterpillar Inc.]]
*[[Samuel J. Palmisano]] Chairman, President and CEO, IBM
*Joan Spero President, [[Doris Duke]] Charitable Foundation
*Sidney Taurell, Chairman and CEO, [[Eli Lilly and Company]]
*[[Lorenzo Zambrano]] Chairman and CEO, [[Cemex]] SAB de CV
Information
'''Information''' as a [[Conveyed concept|concept]] has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of [[constraint]], [[communication]], [[control system|control]], [[data]], [[form]], [[instruction]], [[knowledge]], [[Meaning (linguistics)|meaning]], [[stimulation|mental stimulus]], [[pattern]], [[perception]], and [[knowledge representation|representation]].
Many people speak about the [[Information Age]] as the advent of the Knowledge Age or [[knowledge society]], the [[information society]], the [[Information revolution]], and [[Information technology|information technologies]], and even though [[informatics]], [[information science]] and [[computer science]] are often in the spotlight, the word "information" is often used without careful consideration of the various meanings it has acquired.
== Etymology ==
According to the [[Oxford English Dictionary]], the earliest historical meaning of the word ''information'' in [[English language|English]] was the act of ''informing'', or giving form or shape to the mind, as in education, instruction, or training. A quote from 1387: "Five books come down from heaven for information of mankind." It was also used for an ''item'' of training, ''e.g.'' a particular instruction. "Melibee had heard the great skills and reasons of Dame Prudence, and her wise information and techniques." (1386)
The English word was apparently derived by adding the common "noun of action" ending "''-ation''" (descended through Francais from Latin "''-tio''") to the earlier verb ''to inform'', in the sense of to give form to the mind, to discipline, instruct, teach: "Men so wise should go and inform their kings." (1330) ''Inform'' itself comes (via French) from the Latin verb ''informare'', to give form to, to form an idea of. Furthermore, Latin itself already even contained the word ''informatio'' meaning concept or idea, but the extent to which this may have influenced the development of the word ''information'' in English is unclear.
As a final note, the ancient Greek word for ''form'' was [eidos], and this word was famously used in a technical philosophical sense by [Plato] (and later Aristotle) to denote the ideal identity or essence of something (see [Theory of forms]). "Eidos" can also be associated with [thought], [proposition] or even [concept].
== Information as a message ==
'''Information''' is the state of a system of interest. Message is the information materialized.
Information is a quality of a [[message]] from a [[sender]] to one or more receivers. Information is always ''about'' something (size of a parameter, occurrence of an event, etc). Viewed in this manner, information does not have to be accurate. It may be a truth or a lie, or just the sound of a falling tree. Even a disruptive noise used to inhibit the flow of communication and create misunderstanding would in this view be a form of information. However, generally speaking, if the ''amount'' of information in the received message increases, the message is more accurate.
This model assumes there is a definite [[sender]] and at least one receiver. Many refinements of the model assume the existence of a common language understood by the sender and at least one of the receivers. An important variation identifies information as that which would be communicated by a message if it were sent from a sender to a receiver capable of understanding the message. Notably, it is not required that the sender be capable of understanding the message, or even cognizant that there is a message. Thus, information is something that can be extracted from an environment, e.g., through observation, reading or measurement.
Information is a term with many meanings depending on context, but is as a rule closely related to such concepts as meaning, knowledge, instruction, communication, representation, and mental stimulus. Simply stated, information is a message received and understood. In terms of data, it can be defined as a collection of facts from which conclusions may be drawn. There are many other aspects of information since it is the knowledge acquired through study or experience or instruction. But overall, information is the result of processing, manipulating and organizing data in a way that adds to the knowledge of the person receiving it. [[Communication theory]] provides a numerical measure of the uncertainty of an outcome. For example, we can say that "the signal contained thousands of bits of information". Communication theory tends to use the concept of [[information entropy]], generally attributed to [[C.E. Shannon]] (see below).
Another form of information is [[Fisher information]], a concept of [[R.A. Fisher]]. This is used in application of statistics to [[estimation theory]] and to science in general. Fisher information is thought of as the amount of information that a message carries about an unobservable parameter. It can be computed from knowledge of the [[likelihood function]] defining the system. For example, with a normal likelihood function, the Fisher information is the reciprocal of the variance of the law. In the absence of knowledge of the likelihood law, the Fisher information may be computed from normally distributed score data as the reciprocal of their second moment.
Even though information and data are often used interchangeably, they are actually very different. Data is a set of unrelated information, and as such is of no use until it is properly evaluated. Upon evaluation, once there is some significant relation between data, and they show some relevance, then they are converted into information. Now this same data can be used for different purposes. Thus, till the data convey some information, they are not useful.
=== Measuring information entropy ===
The view of information as a message came into prominence with the publication in 1948 of an influential paper by [[Claude Shannon]], "[[A Mathematical Theory of Communication]]." This paper provides the foundations of [[information theory]] and endows the word ''information'' not only with a technical meaning but also a measure. If the sending device is equally likely to send any one of a set of messages, then the preferred measure of "the information produced when one message is chosen from the set" is the base two [[logarithm]] of (This measure is called ''[[self-information]]''). In this paper, Shannon continues:
A complementary way of measuring information is provided by [[algorithmic information theory]]. In brief, this measures the information content of a list of symbols based on how predictable they are, or more specifically how easy it is to compute the list through a [[computer program|program]]: the information content of a sequence is the number of bits of the shortest program that computes it. The sequence below would have a very low algorithmic information measurement since it is a very predictable pattern, and as the pattern continues the measurement would not change. Shannon information would give the same information measurement for each symbol, since they are [[statistical randomness|statistically random]], and each new symbol would increase the measurement.
:123456789101112131415161718192021
It is important to recognize the limitations of traditional information theory and algorithmic information theory from the perspective of human meaning. For example, when referring to the meaning content of a message Shannon noted “Frequently the messages have ''meaning…'' these semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected ''from a set of possible messages''” (emphasis in original).
In information theory signals are part of a process, not a substance; they do something, they do not contain any specific meaning. Combining algorithmic information theory and information theory we can conclude that the most random signal contains the most information as it can be interpreted in any way and cannot be compressed.
Michael Reddy noted that "'signals' of the [[mathematical theory]] are 'patterns that can be exchanged'. There is no message contained in the signal, the signals convey the ability to select from a set of possible messages." In information theory "the system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design".
== Information as a pattern ==
Information is any represented [[pattern]]. This view assumes neither accuracy nor directly communicating parties, but instead assumes a separation between an object and its representation. Consider the following example: [[economic statistics]] represent an [[Economics|economy]], however inaccurately. What are commonly referred to as data in [[computing]], [[statistics]], and other fields, are forms of information in this sense. The [[electromagnetism|electro-magnetic]] patterns in a [[computer network]] and connected [[peripheral device|device]]s are related to something other than the pattern itself, such as [[Character (computing)|text characters]] to be displayed and [[Computer keyboard|keyboard]] input. [[Signal (information theory)|Signal]]s, [[Sign (linguistics)|sign]]s, and [[symbol]]s are also in this category. On the other hand, according to [[semiotics]], data is symbols with certain syntax and information is data with a certain semantic. [[Painting]] and [[drawing]] contain information to the extent that they represent something such as an assortment of objects on a table, a [[profile]], or a [[landscape]]. In other words, when a pattern of something is transposed to a pattern of something else, the latter is information. This would be the case whether or not there was anyone to perceive it.
But if information can be defined merely as a pattern, does that mean that neither [[utility]] nor meaning are necessary components of information? Arguably a distinction must be made between raw unprocessed data and information which possesses utility, [[value (economics)|value]] or some quantum of meaning. On this view, information may indeed be characterized as a pattern; but this is a [[necessary]] condition, not a [[sufficient]] one.
An individual entry in a telephone book, which follows a specific pattern formed by name, address and telephone number, does not become "informative" in some sense unless and until it possesses some degree of utility, value or meaning. For example, someone might look up a girlfriend's number, might order a take away etc. The vast majority of numbers will never be construed as "information" in any meaningful sense. The gap between data and information is only closed by a behavioral bridge whereby some value, utility or meaning is added to transform mere data or pattern into information.
When one constructs a representation of an object, one can selectively extract from the object ([[sampling (case studies)|sampling]]) or use a [[system]] of signs to replace ([[encode|encoding]]), or both. The sampling and encoding result in representation. An example of the former is a "sample" of a product; an example of the latter is "verbal description" of a product. Both contain information of the product, however inaccurate. When one interprets representation, one can predict a broader pattern from a limited number of observations (inference) or understand the relation between patterns of two different things ([[decode|decoding]]). One example of the former is to sip a [[soup]] to know if it is spoiled; an example of the latter is examining footprints to determine the animal and its condition. In both cases, information sources are not constructed or presented by some "sender" of information.
Regardless, information is dependent upon, but usually unrelated to and separate from, the medium or media used to express it. In other words, the position of a theoretical series of bits, or even the output once interpreted by a [[computer]] or similar device, is unimportant, except when someone or something is present to interpret the information. Therefore, a quantity of information is totally distinct from its medium.
== Information as sensory input ==
Often information is viewed as a type of [[input]] to an [[organism]] or designed device. Inputs are of two kinds. Some inputs are important to the function of the organism (for example, food) or device ([[energy]]) by themselves. In his book ''Sensory Ecology,'' Dusenbery called these causal inputs. Other inputs (information) are important only because they are associated with causal inputs and can be used to predict the occurrence of a causal input at a later time (and perhaps another place). Some information is important because of association with other information but eventually there must be a connection to a causal input. In practice, information is usually carried by weak stimuli that must be detected by specialized sensory systems and amplified by energy inputs before they can be functional to the organism or device. For example, light is often a causal input to plants but provides information to animals. The colored light reflected from a flower is too weak to do much photosynthetic work but the visual system of the bee detects it and the bee's nervous system uses the information to guide the bee to the flower, where the bee often finds nectar or pollen, which are causal inputs, serving a nutritional function.
Information is any type of sensory input. When an organism with a [[nervous system]] receives an input, it transforms the input into an electrical signal. This is regarded information by some. The idea of representation is still relevant, but in a slightly different manner. That is, while [[abstract painting]] does not represent anything concretely, when the viewer sees the painting, it is nevertheless transformed into electrical signals that create a representation of the painting. Defined this way, information does not have to be related to truth, communication, or representation of an object. [[Entertainment]] in general is not intended to be informative. [[Music]], the [[performing arts]], [[amusement park]]s, works of [[fiction]] and so on are thus forms of information in this sense, but they are not necessarily forms of information according to some definitions given above. Consider another example: food supplies both nutrition and taste for those who eat it. If information is equated to sensory input, then nutrition is not information but taste is.
== Information as an influence which leads to a transformation ==
Information is any type of pattern that influences the formation or transformation of other patterns. In this sense, there is no need for a conscious mind to perceive, much less appreciate, the pattern. Consider, for example, [[DNA]]. The sequence of [[nucleotide]]s is a pattern that influences the formation and development of an organism without any need for a conscious mind. [[Systems theory]] at times seems to refer to information in this sense, assuming information does not necessarily involve any conscious mind, and patterns circulating (due to [[feedback]]) in the system can be called information. In other words, it can be said that information in this sense is something potentially perceived as representation, though not created or presented for that purpose.
When [[Marshall McLuhan]] speaks of [[media (communication)|media]] and their effects on human cultures, he refers to the structure of [[cultural artifact|artifacts]] that in turn shape our behaviors and mindsets. Also, [[pheromone]]s are often said to be "information" in this sense.
(See also [[Gregory Bateson]].)
== Information as a property in physics ==
In 2003, J. D. Bekenstein claimed there is a growing trend in [[physics]] to define the physical world as being made of information itself (and thus information is defined in this way). Information has a well defined meaning in physics. Examples of this include the phenomenon of [[quantum entanglement]] where particles can interact without reference to their separation or the speed of light. Information itself cannot travel faster than light even if the information is transmitted indirectly. This could lead to the fact that all attempts at physically observing a particle with an "entangled" relationship to another are slowed down, even though the particles are not connected in any other way other than by the information they carry.
Another link is demonstrated by the [[Maxwell's demon]] thought experiment. In this experiment, a direct relationship between information and another physical property, [[entropy]], is demonstrated. A consequence is that it is impossible to destroy information without increasing the entropy of a system; in practical terms this often means generating heat. Another, more philosophical, outcome is that information could be thought of as interchangeable with [[Energy#Transformations_of_energy|energy]]. Thus, in the study of [[logic gates]], the theoretical lower bound of thermal energy released by an ''AND gate'' is higher than for the ''NOT gate'' (because information is destroyed in an ''AND gate'' and simply converted in a ''NOT gate''). Physical information is of particular importance in the theory of [[quantum computers]].
== Information as records ==
Records are a specialized form of information. Essentially, records are information produced consciously or as by-products of business activities or transactions and retained because of their value. Primarily their value is as evidence of the activities of the organization but they may also be retained for their informational value. Sound [[records management]] ensures that the integrity of records is preserved for as long as they are required.
The international standard on records management, ISO 15489, defines records as "information created, received, and maintained as evidence and information by an organization or person, in pursuance of legal obligations or in the transaction of business". The International Committee on Archives (ICA) Committee on electronic records defined a record as, "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity".
Records may be retained because of their business value, as part of the [[corporate memory]] of the organization or to meet legal, fiscal or accountability requirements imposed on the organization. Willis (2005) expressed the view that sound management of business records and information delivered "…six key requirements for good [[corporate governance]]…transparency; accountability; due process; compliance; meeting statutory and common law requirements; and security of personal and corporate information."
== Information and semiotics ==
Beynon-Davies explains the multi-faceted concept of information in terms of that of signs and sign-systems. Signs themselves can be considered in terms of four inter-dependent levels, layers or branches of [[semiotics]]: pragmatics, semantics, syntactics and empirics. These four layers serve to connect the social world on the one hand with the physical or technical world on the other.
[[Pragmatics]] is concerned with the purpose of communication. Pragmatics links the issue of signs with that of intention. The focus of pragmatics is on the intentions of human agents underlying communicative behaviour. In other words, intentions link language to action. [[Semantics]] is concerned with the meaning of a message conveyed in a communicative act. Semantics considers the content of communication. Semantics is the study of the meaning of signs - the association between signs and behaviour. Semantics can be considered as the study of the link between symbols and their referents or concepts; particularly the way in which signs relate to human behaviour.
Syntactics is concerned with the formalism used to represent a message. Syntactics as an area studies the form of communication in terms of the logic and grammar of sign systems. Syntactics is devoted to the study of the form rather than the content of signs and sign-systems.
Empirics is the study of the signals used to carry a message; the physical characteristics of the medium of communication. Empirics is devoted to the study of communication channels and their characteristics, e.g., sound, light, electronic transmission etc.
Communication normally exists within the context of some social situation. The social situation sets the context for the intentions conveyed (pragmatics) and the form in which communication takes place. In a communicative situation intentions are expressed through messages which comprise collections of inter-related signs taken from a language which is mutually understood by the agents involved in the communication. Mutual understanding implies that agents involved understand the chosen language in terms of its agreed syntax (syntactics) and semantics. The sender codes the message in the language and sends the message as signals along some communication channel (empirics). The chosen communication channel will have inherent properties which determine outcomes such as the speed with which communication can take place and over what distance.
Information extraction
In [[natural language processing]], '''information extraction''' (IE) is a type of [[information retrieval]] whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured [[machine-readable]] documents. An example of information extraction is the extraction of instances of corporate mergers, more formally , from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp." A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data.
The significance of IE is determined by the growing amount of information available in unstructured (i.e. without [[metadata]]) form, for instance on the Internet. This knowledge can be made more accessible by means of transformation into [[relational database|relational form]], or by marking-up with [[XML]] tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.
A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted. Current approaches to IE use [[natural language processing]] techniques that focus on very restricted domains. For example, the ''[[Message Understanding Conference]]'' (MUC) is a competition-based conference that focused on the following domains in the past:
*MUC-1 (1987), MUC-2 (1989): Naval operations messages.
*MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
*MUC-5 (1993): Joint ventures and microelectronics domain.
*MUC-6 (1995): News articles on management changes.
*MUC-7 (1998): Satellite launch reports.
Natural Language texts may need to use some form of a [[Text simplification]] to create a more easily machine readable text to extract the sentences.
Typical subtasks of IE are:
* [[Named Entity Recognition]]: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
* [[Coreference]]: identification chains of [[noun phrase]]s that refer to the same object. For example, [[Anaphora (linguistics)|anaphora]] is a type of coreference.
* [[Terminology extraction]]: finding the relevant terms for a given [[text corpus|corpus]]
* Relation Extraction: identification of relations between entities, such as:
**PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
**PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Information retrieval
'''Information retrieval''' ('''IR''') is the science of searching for documents, for [[information]] within documents and for [[Metadata (computing)|metadata]] about documents, as well as that of searching [[relational database]]s and the [[World Wide Web]]. There is overlap in the usage of the terms data retrieval, [[document retrieval]], information retrieval, and [[text retrieval]], but each also has its own body of literature, theory, [[Praxis (process)|praxis]] and technologies. IR is [[interdisciplinary]], based on [[computer science]], [[mathematics]], [[library science]], [[information science]], [[information architecture]], [[cognitive psychology]], [[linguistics]], [[statistics]] and [[physics]].
Automated information retrieval systems are used to reduce what has been called "[[information overload]]". Many universities and [[public library|public libraries]] use IR systems to provide access to books, journals and other documents. Web [[Web search engine|search engine]]s are the most visible [[Information retrieval applications|IR applications]].
== History ==
The idea of using computers to search for relevant pieces of information was popularized in an article ''[[As We May Think]]'' by [[Vannevar Bush]] in 1945. First implementations of information retrieval systems were introduced in the 1950s and 1960s. By 1990 several different techniques had been shown to perform well on small text corpora (several thousand documents).
In 1992 the US Department of Defense, along with the [[National Institute of Standards and Technology]] (NIST), cosponsored the [[Text Retrieval Conference]] (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that [[scalability|scale]] to huge corpora. The introduction of web [[Web search engine|search engine]]s has boosted the need for very large scale retrieval systems even further.
The use of digital methods for storing and retrieving information has led to the phenomenon of [[digital obsolescence]], where a digital resource ceases to be readable because the physical media, the reader required to read the media, the hardware, or the software that runs on it, is no longer available. The information is initially easier to retrieve than if it were on paper, but is then effectively lost.
=== Timeline ===
* 1890: Hollerith tabulating machines were used to analyze the US census. ([[Herman Hollerith]]).
* 1945: [[Vannevar Bush]]'s ''[[As We May Think]]'' appeared in ''[[Atlantic Monthly]]''
* Late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans.
* 1947: [[Hans Peter Luhn]] (research engineer at IBM since 1941) began work on a mechanized, punch card based system for searching chemical compounds.
* 1950: The term "information retrieval" may have been coined by [[Calvin Mooers]].
* 1950s: Growing concern in the US for a "science gap" with the USSR motivated, encouraged funding, and provided a backdrop for mechanized literature searching systems ([[Allen Kent]] et al) and the invention of citation indexing ([[Eugene Garfield]]).
* 1955: Allen Kent joined [[Case Western Reserve University]], and eventually becomes associate director of the Center for Documentation and Communications Research. That same year, Kent and colleagues publish a paper in American Documentation describing the precision and recall measures, as well as detailing a proposed "framework" for evaluating an IR system, which includes statistical sampling methods for determining the number of relevant documents not retrieved.
* 1958: International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified. See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959)
* 1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
* 1960: Melvin Earl (Bill) Maron and J. L. Kuhns published "On relevance, probabilistic indexing, and information retrieval" in Journal of the ACM 7(3):216-244, July 1960.
* Early 1960s: [[Gerard Salton]] began work on IR at Harvard, later moved to Cornell.
* 1962: [[Cyril W. Cleverdon]] published early findings of the Cranfield studies, developing a model for IR system evaluation. See: Cyril W. Cleverdon, "Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems". Cranfield Coll. of Aeronautics, Cranfield, England, 1962.
* 1962: Kent published Information Analysis and Retrieval
* 1963: Weinberg report "Science, Government and Information" gave a full articulation of the idea of a "crisis of scientific information." The report was named after Dr. [[Alvin Weinberg]].
* 1963: [[Joseph Becker]] and [[Robert M. Hayes]] published text on information retrieval. Becker, Joseph; Hayes, Robert Mayo. Information storage and retrieval: tools, elements, theories. New York, Wiley (1963).
* 1964: [[Karen Spärck Jones]] finished her thesis at Cambridge, ''Synonymy and Semantic Classification'', and continued work on [[computational linguistics]] as it applies to IR
* 1964: The [[National Bureau of Standards]] sponsored a symposium titled "Statistical Association Methods for Mechanized Documentation." Several highly significant papers, including G. Salton's first published reference (we believe) to the SMART system.
* Mid-1960s: National Library of Medicine developed [[MEDLARS]] Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch retrieval system
* Mid-1960s: Project Intrex at MIT
* 1965: [[J. C. R. Licklider]] published ''Libraries of the Future''
* 1966: [[Don Swanson]] was involved in studies at University of Chicago on Requirements for Future Catalogs
* 1968: Gerard Salton published ''Automatic Information Organization and Retrieval''.
* 1968: [[J. W. Sammon]]'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model.
* 1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers) was the first proposal for visualization interface to an IR system.
* Late 1960s: [[F. W. Lancaster]] completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval
* Early 1970s: first online systems--NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT
* Early 1970s: [[Theodor Nelson]] promoting concept of [[hypertext]], published Computer Lib/Dream Machines
* 1971: [[N. Jardine]] and [[C. J. Van Rijsbergen]] published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." (Information Storage and Retrieval, 7(5), pp. 217-240, Dec 1971)
*1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:
** A Theory of Indexing (Society for Industrial and Applied Mathematics)
** "A theory of term importance in automatic text analysis", (JASIS v. 26)
** "A vector space model for automatic indexing", (CACM 18:11)
* 1978: The First [[Association for Computing Machinery|ACM]] [[SIGIR]] conference.
* 1979: C. J. Van Rijsbergen published ''Information Retrieval'' (Butterworths). Heavy emphasis on probabilistic models.
* 1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge
* 1982: [[Nicholas J. Belkin|Belkin]], Oddy, and Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. This was an important concept, though their automated analysis tool proved ultimately disappointing.
* 1983: Salton (and M. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models.
* Mid-1980s: Efforts to develop end user versions of commercial IR systems.
* 1985-1993: Key papers on and experimental systems for visualization interfaces.
* Work by [[D. B. Crouch]], [[Robert R. Korfhage]], [[M. Chalmers]], [[A. Spoerri]] and others.
* 1989: First [[World Wide Web]] proposals by [[Tim Berners-Lee]] at [[CERN]].
* 1992: First TREC conference.
* 1997: Publication of [[Robert R. Korfhage|Korfhage]]'s ''Information Storage and Retrieval'' with emphasis on visualization and multi-reference point systems.
* Late 1990s: Web [[Web search engine|search engine]] implementation of many features formerly found only in experimental IR systems
== Overview ==
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of [[information need]]s, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of [[relevance|relevancy]].
An object is an entity which keeps or stores information in a database. User queries are matched to objects stored in the database. Depending on the [[Information retrieval applications|application]] the data objects may be, for example, text documents, images or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.
Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.
== Performance measures ==
Many different measures for evaluating the performance of information retrieval systems have been proposed. The measures require a collection of documents and a query. All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. In practice queries may be [[ill-posed]] and there may be different shades of relevancy.
=== Precision ===
Precision is the fraction of the documents retrieved that are [[Relevance (information retrieval)|relevant]] to the user's information need.
:
In [[binary classification]], precision is analogous to [[positive predictive value]]. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called ''precision at n'' or ''P@n''.
Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of [[accuracy and precision]] within other branches of science and technology.
=== Recall ===
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
:
In binary classification, recall is called [[sensitivity (tests)|sensitivity]]. So it can be looked at as ''the probability that a relevant document is retrieved by the query''.
It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.
=== Fall-Out ===
The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:
:
In binary classification, fall-out is closely related to [[specificity (tests)|specificity]]. More precisely: . It can be looked at as ''the probability that a non-relevant document is retrieved by the query''.
It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.
=== F-measure ===
The weighted [[harmonic mean]] of precision and recall, the traditional F-measure or balanced F-score is:
:
This is also known as the measure, because recall and precision are evenly weighted.
The general formula for non-negative real ß is:
:
Two other commonly used F measures are the measure, which weights recall twice as much as precision, and the measure, which weights precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that "measures the effectiveness of retrieval with respect to a user who attaches ß times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure . Their relationship is where .
=== Average precision of precision and recall===
The precision and recall are based on the whole list of documents returned by the system. Average precision emphasizes returning more relevant documents earlier. It is average of precisions computed after truncating the list after each of the relevant documents in turn:
:
where ''r'' is the rank, ''N'' the number retrieved, ''rel()'' a binary function on the relevance of a given rank, and ''P()'' precision at a given cut-off rank.
== Model types ==
[[Image:Information-Retrieval-Models.png|thumb|500px|categorization of IR-models (translated from [http://de.wikipedia.org/wiki/Informationsrückgewinnung#Klassifikation_von_Modellen_zur_Repr.C3.A4sentation_nat.C3.BCrlichsprachlicher_Dokumente German entry], original source [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id= Dominik Kuropka])]]
For the information retrieval to be efficient, the documents are typically transformed into a suitable representation. There are several representations. The picture on the right illustrates the relationship of some common models. In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.
=== First dimension: mathematical basis ===
* ''Set-theoretic models'' represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are:
** [[Standard Boolean model]]
** [[Extended Boolean model]]
** [[Fuzzy retrieval]]
* ''Algebraic models'' represent documents and queries usually as vectors, matrices or tuples. The similarity of the query vector and document vector is represented as a scalar value.
** [[Vector space model]]
** [[Generalized vector space model]]
** Topic-based vector space model (literature: [http://www.kuropka.net/files/TVSM.pdf], [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=])
** [[Extended Boolean model]]
** Enhanced topic-based vector space model (literature: [http://kuropka.net/files/HPI_Evaluation_of_eTVSM.pdf], [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=])
** Latent semantic indexing aka [[latent semantic analysis]]
* ''Probabilistic models'' treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like the [[Bayes' theorem]] are often used in these models.
** [[Binary independence retrieval]]
** [[Probabilistic relevance model (BM25)]]
** Uncertain inference
** [[Language model]]s
** [[Divergence-from-randomness model]]
** [[Latent Dirichlet allocation]]
=== Second dimension: properties of the model ===
* ''Models without term-interdependencies'' treat different terms/words as independent. This fact is usually represented in vector space models by the [[orthogonality]] assumption of term vectors or in probabilistic models by an [[independency]] assumption for term variables.
* ''Models with immanent term interdependencies'' allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by [[dimension reduction|dimensional reduction]]) from the [[co-occurrence]] of those terms in the whole set of documents.
* ''Models with transcendent term interdependencies'' allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They relay an external source for the degree of interdependency between two terms. (For example a human or sophisticated algorithms.)
== Major figures ==
* [[Gerard Salton]]
* [[Hans Peter Luhn]]
* [http://ciir.cs.umass.edu/personnel/croft.html W. Bruce Croft]
* [[Karen Spärck Jones]]
* [[C. J. van Rijsbergen]]
* [http://www.soi.city.ac.uk/~ser/homepage.html Stephen E. Robertson]
== Awards in the field ==
* [[Tony Kent Strix award]]
* [[Gerard Salton Award]]
Information theory
'''Information theory''' is a branch of [[applied mathematics]] and [[electrical engineering]] involving the quantification of [[information]]. Historically, information theory was developed to find fundamental limits on compressing and reliably [[communication|communicating]] data. Since its inception it has broadened to find applications in many other areas, including [[statistical inference]], [[natural language processing]], [[cryptography]] generally, [[networks]] other than communication networks -- as in [[neurobiology]], the evolution and function of molecular codes, model selection in ecology, thermal physics, [[quantum computing]], plagiarism detection and other forms of [[data analysis]].
A key measure of information in the theory is known as [[information entropy]], which is usually expressed by the average number of bits needed for storage or communication. Intuitively, entropy quantifies the uncertainty involved when encountering a [[random variable]]. For example, a fair coin flip (2 equally likely outcomes) will have less entropy than a roll of a die (6 equally likely outcomes).
Applications of fundamental topics of information theory include [[lossless data compression]] (e.g. [[ZIP (file format)|ZIP files]]), [[lossy data compression]] (e.g. [[MP3]]s), and [[channel capacity|channel coding]] (e.g. for [[DSL]] lines). The field is at the intersection of [[mathematics]], [[statistics]], [[computer science]], [[physics]], [[neurobiology]], and [[electrical engineering]]. Its impact has been crucial to the success of the [[Voyager program|Voyager]] missions to deep space, the invention of the CD, the feasibility of mobile phones, the development of the [[Internet]], the study of [[linguistics]] and of human perception, the understanding of [[black hole]]s, and numerous other fields. Important sub-fields of information theory are source coding, channel coding, algorithmic complexity theory, algorithmic information theory, and measures of information.
==Overview==
The main concepts of information theory can be grasped by considering the most widespread means of human communication: language. Two important aspects of a good language are as follows: First, the most common words (e.g., "a", "the", "I") should be shorter than less common words (e.g., "benefit", "generation", "mediocre"), so that sentences will not be too long. Such a tradeoff in word length is analogous to [[data compression]] and is the essential aspect of [[source coding]]. Second, if part of a sentence is unheard or misheard due to noise -— e.g., a passing car -— the listener should still be able to glean the meaning of the underlying message. Such robustness is as essential for an electronic communication system as it is for a language; properly building such robustness into communications is done by [[Channel capacity|channel coding]]. Source coding and channel coding are the fundamental concerns of information theory.
Note that these concerns have nothing to do with the ''importance'' of messages. For example, a platitude such as "Thank you; come again" takes about as long to say or write as the urgent plea, "Call an ambulance!" while clearly the latter is more important and more meaningful. Information theory, however, does not consider message importance or meaning, as these are matters of the quality of data rather than the quantity and readability of data, the latter of which is determined solely by probabilities.
Information theory is generally considered to have been founded in 1948 by [[Claude Elwood Shannon|Claude Shannon]] in his seminal work, "[[A Mathematical Theory of Communication]]." The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel. The most fundamental results of this theory are Shannon's [[source coding theorem]], which establishes that, on average, the number of ''bits'' needed to represent the result of an uncertain event is given by its [[information entropy|entropy]]; and Shannon's [[noisy-channel coding theorem]], which states that ''reliable'' communication is possible over ''noisy'' channels provided that the rate of communication is below a certain threshold called the channel capacity. The channel capacity can be approached in practice by using appropriate encoding and decoding systems.
Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: [[adaptive system]]s, [[anticipatory system]]s, [[artificial intelligence]], [[complex system]]s, [[complexity science]], [[cybernetics]], [[informatics]], [[machine learning]], along with [[systems science]]s of many descriptions. Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of [[coding theory]].
Coding theory is concerned with finding explicit methods, called ''codes'', of increasing the efficiency and reducing the net error rate of data communication over a noisy channel to near the limit that Shannon proved is the maximum possible for that channel. These codes can be roughly subdivided into [[data compression]] (source coding) and [[error-correction]] (channel coding) techniques. In the latter case, it took many years to find the methods Shannon's work proved were possible. A third class of information theory codes are cryptographic algorithms (both [[code (cryptography)|code]]s and [[cipher]]s). Concepts, methods and results from coding theory and information theory are widely used in [[cryptography]] and [[cryptanalysis]]. ''See the article [[ban (information)]] for a historical application.''
Information theory is also used in [[information retrieval]], [[intelligence (information gathering)|intelligence gathering]], [[gambling]], [[statistics]], and even in [[musical composition]].
==Historical background==
The landmark event that established the discipline of information theory, and brought it to immediate worldwide attention, was the publication of [[Claude E. Shannon]]'s classic paper "[[A Mathematical Theory of Communication]]" in the ''[[Bell System Technical Journal]]'' in July and October of 1948.
Prior to this paper, limited information theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability. [[Harry Nyquist]]'s 1924 paper, ''Certain Factors Affecting Telegraph Speed,'' contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation , where ''W'' is the speed of transmission of intelligence, ''m'' is the number of different voltage levels to choose from at each time step, and ''K'' is a constant. [[Ralph Hartley]]'s 1928 paper, ''Transmission of Information,'' uses the word ''information'' as a measurable quantity, reflecting the receiver's ability to distinguish that one sequence of symbols from any other, thus quantifying information as , where ''S'' was the number of possible symbols, and ''n'' the number of symbols in a transmission. The natural unit of information was therefore the decimal digit, much later renamed the [[ban (information)|hartley]] in his honour as a unit or scale or measure of information. [[Alan Turing]] in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war [[Cryptanalysis of the Enigma|Enigma]] ciphers.
Much of the mathematics behind information theory with events of different probabilities was developed for the field of [[thermodynamics]] by [[Ludwig Boltzmann]] and [[J. Willard Gibbs]]. Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by [[Rolf Landauer]] in the 1960s, are explored in ''[[Entropy in thermodynamics and information theory]]''.
In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion that
:"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."
With it came the ideas of
* the [[information entropy]] and [[redundancy (information theory)|redundancy]] of a source, and its relevance through the [[source coding theorem]];
* the [[mutual information]], and the [[channel capacity]] of a noisy channel, including the promise of perfect loss-free communication given by the [[noisy-channel coding theorem]];
* the practical result of the [[Shannon–Hartley law]] for the channel capacity of a Gaussian channel; and of course
* the [[bit]]—a new way of seeing the most fundamental unit of information
==Ways of measuring information==
Information theory is based on [[probability theory]] and [[statistics]]. The most important quantities of information are [[Information entropy|entropy]], the information in a [[random variable]], and [[mutual information]], the amount of information in common between two random variables. The former quantity indicates how easily message data can be [[data compression|compressed]] while the latter can be used to find the communication rate across a [[Channel (communications)|channel]].
The choice of logarithmic base in the following formulae determines the [[units of measurement|unit]] of [[information entropy]] that is used. The most common unit of information is the [[bit]], based on the [[binary logarithm]]. Other units include the [[nat (information)|nat]], which is based on the [[natural logarithm]], and the [[deciban|hartley]], which is based on the [[common logarithm]].
In what follows, an expression of the form is considered by convention to be equal to zero whenever This is justified because for any logarithmic base.
===Entropy===
The '''[[information entropy|entropy]]''', , of a discrete random variable is a measure of the amount of ''uncertainty'' associated with the value of .
Suppose one transmits 1000 bits (0s and 1s). If these bits are known ahead of transmission (to be a certain value with absolute probability), logic dictates that no information has been transmitted. If, however, each is equally and independently likely to be 0 or 1, 1000 bits (in the information theoretic sense) have been transmitted. Between these two extremes, information can be quantified as follows. If is the set of all messages that could be, and is the probability of given , then the entropy of is defined:
:
(Here, is the [[self-information]], which is the entropy contribution of an individual message.) An important property of entropy is that it is maximized when all the messages in the message space are equiprobable—i.e., most unpredictable—in which case
The special case of information entropy for a random variable with two outcomes is the '''[[binary entropy function]]''':
:
===Joint entropy===
The '''[[joint entropy]]''' of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are [[statistical independence|independent]], then their joint entropy is the sum of their individual entropies.
For example, if represents the position of a [[chess]] piece — the row and the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
:
Despite similar notation, joint entropy should not be confused with '''[[cross entropy]]'''.
===Conditional entropy (equivocation)===
The '''[[conditional entropy]]''' or '''conditional uncertainty''' of given random variable (also called the '''equivocation''' of about ) is the average conditional entropy over :
:
Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:
:
===Mutual information (transinformation)===
'''[[Mutual information]]''' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of relative to is given by:
:
where (''S''pecific mutual ''I''nformation) is the [[pointwise mutual information]].
A basic property of the mutual information is that
:
That is, knowing ''Y'', we can save an average of bits in encoding ''X'' compared to not knowing ''Y''.
Mutual information is [[symmetric function|symmetric]]:
:
Mutual information can be expressed as the average [[Kullback–Leibler divergence]] (information gain) of the [[posterior probability|posterior probability distribution]] of ''X'' given the value of ''Y'' to the [[prior probability|prior distribution]] on ''X'':
:
In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
:
Mutual information is closely related to the [[likelihood-ratio test|log-likelihood ratio test]] in the context of contingency tables and the [[multinomial distribution]] and to [[Pearson's chi-square test|Pearson's χ2 test]]: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
===Kullback–Leibler divergence (information gain)===
The '''[[Kullback–Leibler divergence]]''' (or '''information divergence''', '''information gain''', or '''relative entropy''') is a way of comparing two distributions: a "true" [[probability distribution]] ''p(X)'', and an arbitrary probability distribution ''q(X)''. If we compress data in a manner that assumes ''q(X)'' is the distribution underlying some data, when, in reality, ''p(X)'' is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined
:
Although it is sometimes used as a 'distance metric', it is not a true [[Metric (mathematics)|metric]] since it is not symmetric and does not satisfy the [[triangle inequality]] (making it a semi-quasimetric).
===Other quantities===
Other important information theoretic quantities include [[Rényi entropy]] (a generalization of entropy) and [[differential entropy]] (a generalization of quantities of information to continuous distributions.)
==Coding theory==
[[Coding theory]] is one of the most important and direct applications of information theory. It can be subdivided into [[data compression|source coding]] theory and [[error correction|channel coding]] theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
* Data compression (source coding): There are two formulations for the compression problem:
#[[lossless data compression]]: the data must be reconstructed exactly;
#[[lossy data compression]]: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. This subset of Information theory is called [[rate–distortion theory]].
* Error-correcting codes (channel coding): While data compression removes as much [[redundancy (information theory)|redundancy]] as possible, an error correcting code adds just the right kind of redundancy (i.e. [[error correction]]) needed to transmit the data efficiently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the [[broadcast channel]]) or intermediary "helpers" (the [[relay channel]]), or more general [[computer network|networks]], compression followed by transmission may no longer be optimal. [[Network information theory]] refers to these multi-agent communication models.
===Source theory===
Any process that generates successive messages can be considered a '''[[Communication source|source]]''' of information. A memoryless source is one in which each message is an [[Independent identically-distributed random variables|independent identically-distributed random variable]], whereas the properties of [[ergodic theory|ergodicity]] and [[stationary process|stationarity]] impose more general constraints. All such sources are [[stochastic process|stochastic]]. These terms are well studied in their own right outside information theory.
====Rate====
Information [[Entropy rate|'''rate''']] is the average entropy per symbol. For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is
:
that is, the conditional entropy of a symbol given all the previous symbols generated. For the more general case of a process that is not necessarily stationary, the ''average rate'' is
:
that is, the limit of the joint entropy per symbol. For stationary sources, these two expressions give the same result.
It is common in information theory to speak of the "rate" or "entropy" of a language. This is appropriate, for example, when the source of information is English prose. The rate of a source of information is related to its [[redundancy (information theory)|redundancy]] and how well it can be [[data compression|compressed]], the subject of '''source coding'''.
===Channel capacity===
Communications over a channel—such as an [[ethernet]] wire—is the primary motivation of information theory. As anyone who's ever used a telephone (mobile or landline) knows, however, such channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality. How much information can one hope to communicate over a noisy (or otherwise imperfect) channel?
Consider the communications process over a discrete channel. A simple model of the process is shown below:
Here ''X'' represents the space of messages transmitted, and ''Y'' the space of messages received during a unit time over our channel. Let be the [[conditional probability]] distribution function of ''Y'' given ''X''. We will consider to be an inherent fixed property of our communications channel (representing the nature of the '''[[Signal noise|noise]]''' of our channel). Then the joint distribution of ''X'' and ''Y'' is completely determined by our channel and by our choice of , the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the rate of information, or the '''[[Signal (electrical engineering)|signal]]''', we can communicate over the channel. The appropriate measure for this is the [[mutual information]], and this maximum mutual information is called the '''[[channel capacity]]''' and is given by:
:
This capacity has the following property related to communicating at information rate ''R'' (where ''R'' is usually bits per symbol). For any information rate ''R < C'' and coding error ε > 0, for large enough ''N'', there exists a code of length ''N'' and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error. In addition, for any rate ''R > C'', it is impossible to transmit with arbitrarily small block error.
'''[[Channel code|Channel coding]]''' is concerned with finding such nearly optimal [[error detection and correction|codes]] that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.
====Channel capacity of particular model channels====
* A continuous-time analog communications channel subject to Gaussian noise — see [[Shannon–Hartley theorem]].
* A [[binary symmetric channel]] (BSC) with crossover probability ''p'' is a binary input, binary output channel that flips the input bit with probability '' p''. The BSC has a capacity of bits per channel use, where is the [[binary entropy function]]:
::
* A binary erasure channel (BEC) with erasure probability '' p '' is a binary input, ternary output channel. The possible channel outputs are ''0'', ''1'', and a third symbol 'e' called an erasure. The erasure represents complete loss of information about an input bit. The capacity of the BEC is ''1 - p'' bits per channel use.
::
==Applications to other fields==
===Intelligence uses and secrecy applications===
Information theoretic concepts apply to [[cryptography]] and [[cryptanalysis]]. [[Turing]]'s information unit, the [[Ban (information)|ban]], was used in the [[Ultra]] project, breaking the German [[Enigma machine]] code and hastening the [[Victory in Europe Day|end of WWII in Europe]]. Shannon himself defined an important concept now called the [[unicity distance]]. Based on the [[redundancy (information theory)|redundancy]] of the [[plaintext]], it attempts to give a minimum amount of [[ciphertext]] necessary to ensure unique decipherability.
Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. A [[brute force attack]] can break systems based on [[public-key cryptography|asymmetric key algorithms]] or on most commonly used methods of [[symmetric-key algorithm|symmetric key algorithms]] (sometimes called secret key algorithms), such as [[block cipher]]s. The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time. [[Information theoretic security]] refers to methods such as the [[one-time pad]] that are not vulnerable to such brute force attacks. In such cases, the positive conditional [[mutual information]] between the [[plaintext]] and [[ciphertext]] (conditioned on the [[key (cryptography)| key]]) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the [[Venona project]] was able to crack the one-time pads of the [[Soviet Union]] due to their improper reuse of key material.
===Pseudorandom number generation===
[[Pseudorandom number generator]]s are widely available in computer language libraries and application programs. They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. A class of improved random number generators is termed [[Cryptographically secure pseudorandom number generator]]s, but even they require external to the software [[random seed]]s to work as intended. These can be obtained via [[extractor]]s, if done carefully. The measure of sufficient randomness in extractors is [[min-entropy]], a value related to Shannon entropy through [[Rényi entropy]]; Rényi entropy is also used in evaluating randomness in cryptographic systems. Although related, the distinctions among these measures mean that a [[random variable]] with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
===Miscellaneous applications===
Information theory also has applications in [[Gambling and information theory|gambling and investing]], [[black hole information paradox|black holes]], [[bioinformatics]], and [[music]].
Italian language
'''Italian''' (, or ''lingua italiana'') is a [[Romance languages|Romance language]] spoken as a [[first language]] by about 63 million people, primarily in [[Italy]]. In [[Switzerland]], Italian is one of four [[Linguistic geography of Switzerland|official language]]s. It is also the official language of [[San Marino]]. It is the primary language of the [[Vatican City]]. Standard Italian, adopted by the state after the [[unification of Italy]], is based on [[Tuscan dialect|Tuscan]] and is somewhat intermediate between [[Italo-Western|Italo-Dalmatian languages]] of the [[Mezzogiorno|South]] and [[Northern Italian dialects]] of the [[Northern Italy|North]].
Unlike most other Romance languages, Italian has retained the contrast between short and [[consonant length|long consonants]] which existed in Latin. As in most [[Romance languages]], [[stress (linguistics)|stress]] is distinctive. Of the Romance languages, Italian is considered to be one of the closest resembling [[Latin]] in terms of [[vocabulary]]. According to Ethnologue, lexical similarity is 89% with [[French language|French]], 87% with [[Catalan language|Catalan]], 85% with [[Sardinian language|Sardinian]], 82% with [[Spanish language|Spanish]], 78% with Rheto-Romance, and 77% with Romanian.
It is affectionately called ''il parlar gentile'' (the gentle language) by its speakers.
==Writing system==
Italian is written using the [[Latin alphabet]]. The letters ''J'', ''K'', ''W'', ''X'' and ''Y'' are not considered part of the standard [[Italian alphabet]], but appear in loanwords (such as ''jeans'', ''whisky'', ''taxi''). ''X'' has become a commonly used letter in genuine Italian words with the prefix ''extra-''. ''J'' in Italian is an old-fashioned orthographic variant of ''I'', appearing in the first name "Jacopo" as well as in some Italian place names, e.g., the towns of [[Bajardo]], [[Bojano]], [[Joppolo]], [[Jesolo]], [[Jesi]], among numerous others, and in the alternate spelling ''Mar Jonio'' (also spelled ''Mar Ionio'') for the [[Ionian Sea]]. ''J'' may also appear in many words from different dialects, but its use is discouraged in contemporary Italian, and it is not part of the standard 21-letter contemporary Italian alphabet. Each of these foreign letters had an Italian equivalent spelling: ''gi'' for ''j'', ''c'' or ''ch'' for ''k'', ''u'' or ''v'' for ''w'' (depending on what sound it makes), ''s'', ''ss'', or ''cs'' for ''x'', and ''i'' for ''y''.
* Italian uses the [[acute accent]] over the letter ''E'' (as in ''perché'', why/because) to indicate a front mid-close vowel, and the [[grave accent]] (as in ''tè'', tea) to indicate a front mid-open vowel. The [[grave accent]] is also used on letters ''A'', ''I'', ''O'', and ''U'' to mark [[stress (linguistics)|stress]] when it falls on the final vowel of a word (for instance ''gioventù'', youth). Typically, the penultimate syllable is stressed. If syllables other than the last one are stressed, the accent is not mandatory, unlike in [[Spanish language|Spanish]], and, in virtually all cases, it is omitted. In some cases, when the word is ambiguous (as ''principi''), the accent mark is sometimes used in order to disambiguate its meaning (in this case, ''prìncipi'', princes, or ''princìpi'', principles). This is, however, not compulsory. Rare words with three or more syllables can confuse Italians themselves, and the pronunciation of [[Istanbul]] is a common example of a word in which placement of stress is not clearly established. Turkish, like French, tends to put the accent on ultimate syllable, but Italian doesn't. So we can hear "Istànbul" or "Ìstanbul". Another instance is the American State of [[Florida]]: the correct way to pronounce it in Italian is like in Spanish, "Florìda", but since there is an Italian word meaning the same ("flourishing"), "flòrida", and because of the influence of English, most Italians pronounce it that way. Dictionaries give the latter as an alternative pronunciation.
* The letter ''H'' at the beginning of a word is used to distinguish ''ho'', ''hai'', ''ha'', ''hanno'' (present indicative of ''avere'', 'to have') from ''o'' ('or'), ''ai'' ('to the'), ''a'' ('to'), ''anno'' ('year'). In the spoken language this letter is always silent for the cases given above. ''H'' is also used in combinations with other letters (see below), but no [[phoneme]] {{IPA|[h]}} exists in Italian. In foreign words entered in common use, like "hotel" or "hovercraft", the H is commonly silent, so they are pronounced as {{IPA|/oˈtɛl/}} and {{IPA|/ˈɔverkraft/}}
* The letter ''Z'' represents {{IPA|/ʣ/}}, for example: ''Zanzara'' {{IPA|/dzan'dzaɾa/}} (mosquito), or {{IPA|/ʦ/}}, for example: ''Nazione'' {{IPA|/naˈttsjone/}} (nation), depending on context, though there are few [[minimal pair]]s. The same goes for ''S'', which can represent {{IPA|/s/}} or {{IPA|/z/}}. However, these two phonemes are in [[complementary distribution]] everywhere except between two vowels in the same word, and even in such environment there are extremely few minimal pairs, so that this distinction is being lost in many varieties.
* The letters ''C'' and ''G'' represent [[affricate]]s: [[Voiceless postalveolar affricate|{{IPA|/ʧ/}}]] as in "chair" and [[Voiced postalveolar affricate|{{IPA|/ʤ/}}]] as in "gem", respectively, before the [[front vowel]]s ''I'' and ''E''. They are pronounced as [[plosive]]s {{IPA|/k/}}, {{IPA|/g/}} (as in "call" and "gall") otherwise. Front/back vowel rules for ''C'' and ''G'' are similar in [[French language|French]], [[Romanian language|Romanian]], [[Spanish language|Spanish]], and to some extent [[English language|English]] (including [[Old English]]). [[swedish language|Swedish]] and [[Norwegian language|Norwegian]] have similar rules for ''K'' and ''G''. (See also [[palatalization]].)
* However, an ''H'' can be added between ''C'' or ''G'' and ''E'' or ''I'' to represent a plosive, and an ''I'' can be added between ''C'' or ''G'' and ''A'', ''O'' or ''U'' to signal that the consonant is an affricate. For example:
:Note that the ''H'' is [[silent letter|silent]] in the digraphs ''[[ch (digraph)|CH]]'' and ''[[gh (digraph)|GH]]'', as also the ''I'' in ''cia'', ''cio'', ''ciu'' and even ''cie'' is not pronounced as a separate vowel, unless it carries the primary stress. For example, it is silent in ''[[ciao]]'' {{IPA|/ˈʧa.o/}} and cielo {{IPA|/ˈʧɛ.lo/}}, but it is pronounced in ''farmacia'' {{IPA|/ˌfaɾ.ma.ˈʧi.a/}} and ''farmacie'' {{IPA|/ˌfaɾ.ma.ˈʧi.e/}}.
* There are three other special [[digraph (orthography)|digraphs]] in Italian: ''[[gn (digraph)|GN]]'', ''GL'' and ''SC''. ''GN'' represents [[Palatal nasal|{{IPA|/ɲ/}}]]. ''GL'' represents [[Palatal lateral approximant|{{IPA|/ʎ/}}]] only before ''i'', and never at the beginning of a word, except in the [[personal pronoun]] and [[definite article]] ''gli''. (Compare with [[Spanish language|Spanish]] ''ñ'' and ''ll'', [[Portuguese language|Portuguese]] ''nh'' and ''lh''.) ''SC'' represents fricative [[Voiceless postalveolar fricative|{{IPA|/ʃ/}}]] before ''i'' or ''e''. Except in the speech of some Northern Italians, all of these are normally [[geminate]] between vowels.
* In general, all letters or digraphs represent phonemes rather clearly, and, in standard varieties of Italian, there is little allophonic variation. The most notable exceptions are assimilation of /n/ in point of articulation before consonants, assimilatory voicing of /s/ to following voiced consonants, and vowel length (vowels are long in stressed open syllables, and short elsewhere) — compare with the enormous number of [[allophone]]s of the English phoneme /t/. Spelling is clearly phonemic and difficult to mistake given a clear pronunciation. Exceptions are generally only found in foreign borrowings. There are fewer cases of [[dyslexia]] than among speakers of languages such as English , and the concept of a spelling bee is strange to Italians.
==History==
The history of the Italian language is long, but the modern standard of the language was largely shaped by relatively recent events. The earliest surviving texts which can definitely be called Italian (or more accurately, vernacular, as opposed to its predecessor [[Vulgar Latin]]) are legal formulae from the region of [[province of Benevento|Benevento]] dating from 960-963. What would come to be thought of as Italian was first formalized in the first years of the 14th century through the works of [[Dante Alighieri]], who mixed southern Italian languages, especially [[Sicilian language|Sicilian]], with his native Tuscan in his epic poems known collectively as the ''[[Divine Comedy|Commedia]],'' to which [[Giovanni Boccaccio]] later affixed the title ''Divina''. Dante's much-loved works were read throughout Italy and his written dialect became the "canonical standard" that all educated Italians could understand. Dante is still credited with standardizing the Italian language and, thus, the dialect of [[Tuscany]] became the basis for what would become the official language of Italy.
Italy has always had a distinctive dialect for each city since the cities were until recently thought of as [[city-state]]s. The latter now has considerable [[variety (linguistics)|variety]], however. As Tuscan-derived Italian came to be used throughout the nation, features of local speech were naturally adopted, producing various versions of Regional Italian. The most characteristic differences, for instance, between [[Romanesco|Roman Italian]] and [[Milanese|Milanese Italian]] are the [[consonant length|gemination]] of initial consonants and the pronunciation of stressed "e", and of "s" in some cases (e.g. ''va bene'' "all right": is pronounced {{IPA|[va ˈbːɛne]}} by a Roman, {{IPA|[va ˈbene]}} by a Milanese; ''a casa'' "at home": Roman {{IPA|[a ˈkːasa]}}, Milanese {{IPA|[a ˈkaza]}}).
In contrast to the [[Northern Italian language|dialects of northern Italy]], [[southern Italian]] dialects were largely untouched by the Franco-[[Occitan language|Occitan]] influences introduced to Italy, mainly by [[bard]]s from [[France]], during the [[Middle Ages]]. Even in the case of Northern Italian dialects, however, scholars are careful not to overstate the effects of outsiders on the natural indigenous developments of the languages. (See [[La Spezia-Rimini Line]].)
The economic might and relative advanced development of [[Tuscany]] at the time ([[Late Middle Ages]]), gave its dialect weight, though Venetian remained widespread in medieval Italian commercial life. Also, the increasing cultural relevance of [[Florence, Italy|Florence]] during the periods of '[[Humanism|Umanesimo (Humanism)]]' and the [[Renaissance|Rinascimento (Renaissance)]] made its ''volgare'' (dialect), or rather a refined version of it, a standard in the arts.
The re-discovery of Dante's ''[[De vulgari eloquentia]]'' and a renewed interest in linguistics in the 16th century sparked a debate which raged throughout Italy concerning which criteria should be chosen to establish a modern Italian standard to be used as much as a literary as a spoken language. Scholars were divided into three factions: the [[purism|purists]], headed by [[Pietro Bembo]] who in his ''[[Gli Asolani]]'' claimed that the language might only be based on the great literary classics (notably, [[Petrarch]], and Boccaccio but not Dante as Bembo believed that the Divine Comedy was not dignified enough as it used elements from other dialects), [[Niccolò Machiavelli]] and other [[Florence|Florentine]]s who preferred the version spoken by ordinary people in their own times, and the [[Courtesan]]s like [[Baldassarre Castiglione]] and [[Gian Giorgio Trissino]] who insisted that each local vernacular must contribute to the new standard. Eventually Bembo's ideas prevailed, the result being the publication of the first Italian dictionary in 1612 and the foundation of the [[Accademia della Crusca]] in Florence (1582-3), the official legislative body of the Italian language.
Italian literature's first modern novel, [[The Betrothed|''I Promessi Sposi'']] (The Betrothed), by [[Alessandro Manzoni]] further defined the standard by "rinsing" his Milanese 'in the waters of the [[Arno River|Arno]]" ([[Florence]]'s river), as he states in the Preface to his 1840 edition.
After unification a huge number of civil servants and soldiers recruited from all over the country introduced many more words and idioms from their home dialects ("[[ciao]]" is [[Venetian language|Venetian]], "[[panettone]]" is [[Milanese]] etc.).
==Classification==
Italian is most closely related to the other two Italo-Dalmatian languages, [[Sicilian language|Sicilian]] and the extinct [[Dalmatian language|Dalmatian]]. The three are part of the [[Italo-Western languages|Italo-Western]] grouping of the [[Romance languages]], which are a subgroup of the [[Italic languages|Italic]] branch of [[Indo-European language family|Indo-European]].
==Geographic distribution==
The total speakers of Italian as maternal language are between 60 and 70 million. The speakers who use Italian as second or cultural language are estimated around 110-120 million .
Italian is the official language of [[Italy]] and [[San Marino]], and one of the official languages of [[Switzerland]], spoken mainly in [[Canton Ticino|Ticino]] and [[Graubünden|Grigioni]] cantons, a region referred to as [[Italian Switzerland]]. It is also the second official language in some areas of [[Istria]], in [[Slovenia]] and [[Croatia]], where an Italian minority exists. It is the primary language of the [[Vatican City]] and is widely used and taught in [[Monaco]] and [[Malta]]. It is also widely understood in France with over one million speakers (especially in [[Corsica]] and the [[County of Nice]], areas that historically spoke [[Italian dialects]] before annexation to [[France]]), and in [[Albania]].
Italian is also spoken by some in former Italian colonies in [[Africa]] ([[Libya]], [[Somalia]] and [[Eritrea]]). However, its use has sharply dropped off since the colonial period. In [[Eritrea]] [[Italian Language|Italian]] is widely understood . In fact, for fifty years, during the colonial period, Italian was the language of instruction, but [[as of 1997]], there is only one Italian language school remaining, with 470 pupils. In [[Somalia]] Italian used to be a major language but due to the civil war and lack of education only the older generation still uses it.
Italian and [[Italian dialects]] are widely used by Italian immigrants and many of their descendants (see ''[[Italians]]'') living throughout [[Western Europe]] (especially [[France]], [[Germany]], [[Belgium]], [[Switzerland]], the [[Britalian|United Kingdom]] and [[Luxembourg]]), the [[Italian Americans|United States]], [[Italian Canadians|Canada]], [[Italian Australians|Australia]], and [[Latin America]] (especially [[Uruguay]], [[Italian Brazilians|Brazil]], [[Argentina]], and [[Venezuela]]).
In the United States, Italian speakers are most commonly found in four cities: [[Boston]] (7,000), [[Chicago]] (12,000), [[New York City]] (140,000), and [[Philadelphia]] (15,000). In Canada there are large Italian-speaking communities in [[Montreal]] (120,000) and [[Toronto]] (195,000). Italian is the second most commonly-spoken language in Australia, where 353,605 [[Italian Australian]]s, or 1.9% of the population, reported speaking Italian at home in the 2001 [[Census in Australia|Census]]. In 2001 there were 130,000 Italian speakers in [[Melbourne]], and 90,000 in [[Sydney]].
===Italian language education===
Italian is widely taught in many schools around the world, but rarely as the first non-native language of pupils; in fact, Italian generally is the fourth or fifth most taught second-language in the world.
In [[anglophone]] parts of [[Canada]], Italian is, after [[French language|French]], the third most taught language. In [[francophone]] Canada it is third after [[English language|English]]. In the [[United States]] and the [[United Kingdom]], Italian ranks fourth (after [[Spanish language|Spanish]]-French-[[German language|German]] and French-German-Spanish respectively). Throughout the world, Italian is the fifth most taught non-native language, after [[English language|English]], French, Spanish, and German.
In the [[European Union]], Italian is spoken as a mother tongue by 13% of the population (64 million, mainly in Italy itself) and as a second language by 3% (14 million); among EU member states, it is most likely to be desired (and therefore learned) as a second language in [[Malta]] (61%), [[Croatia]] (14%), [[Slovenia]] (12%), [[Austria]] (11%), [[Romania]] (8%), [[France]] (6%), and [[Greece]] (6%). It is also an important second language in [[Albania]] and [[Switzerland]], which are not EU members or candidates.
===Influence and derived languages===
From the late 19th to the mid 20th century, thousands of Italians settled in Argentina, Uruguay and southern Brazil, where they formed a very strong physical and cultural presence (see the [[Italian diaspora]]).
In some cases, colonies were established where variants of [[Italian dialects]] were used, and some continue to use a derived dialect. An example is [[Rio Grande do Sul]], [[Brazil]], where [[Talian]] is used and in the town of [[Chipilo]] near Puebla, [[Mexico]] each continuing to use a derived form of [[Venetian language|Venetian]] dating back to the 19th century. Another example is [[Cocoliche]], an Italian-Spanish [[pidgin]] once spoken in [[Argentina]] and especially in [[Buenos Aires]], and [[Lunfardo]]. [[Rioplatense Spanish]], and particularly the speech of the city of Buenos Aires, has intonation patterns that resemble those of Italian dialects, due to the fact that Argentina had a constant, large influx of Italian settlers since the second half of the nineteenth century; initially primarily from Northern Italy then, since the beginning of the twentieth century, mostly from Southern Italy.
===Lingua Franca===
Starting in late [[medieval]] times, Italian language variants replaced Latin to become the primary commercial language for much of Europe and Mediterranean Sea (especially the Tuscan and Venetian variants). This became solidified during the [[Renaissance]] with the strength of Italian banking and the rise of [[Renaissance humanism|humanism]] in the arts.
During the period of the Renaissance, Italy held artistic sway over the rest of Europe. All educated European gentlemen were expected to make the [[Grand Tour]], visiting Italy to see its great historical monuments and works of art. It thus became expected that educated Europeans would learn at least some Italian; the English poet [[John Milton]], for instance, wrote some of his early poetry in Italian. In England, Italian became the second most common modern language to be learned, after [[French language|French]] (though the classical languages, [[Latin]] and [[Greek language|Greek]], came first). However, by the late eighteenth century, Italian tended to be replaced by [[German language|German]] as the second modern language on the curriculum. Yet Italian [[loanword]]s continue to be used in most other [[European languages]] in matters of art and music.
Today, the Italian language continues to be used as a [[lingua franca]] in some environments. Within the [[Catholic church]] Italian is known by a large part of the ecclesiastic hierarchy, and is used in substitution of [[Latin]] in some official documents. The presence of Italian as the primary language in the [[Vatican City]] indicates not only use within the [[Holy See]], but also throughout the world where an episcopal seat is present. It continues to be used in [[music]] and [[opera]]. Other examples where Italian is sometimes used as a means communication is in some sports (sometimes in [[Football (association)|football]] and [[motorsports]]) and in the [[design]] and [[fashion]] industries.
==Dialects==
In Italy, all [[Romance languages]] spoken as the vernacular, other than standard Italian and other unrelated, non-Italian languages, are termed "Italian dialects". Many Italian dialects are, in fact, historical languages in their own right. These include recognized language groups such as [[Friulian language|Friulian]], [[Neapolitan language|Neapolitan]], [[Sardinian language|Sardinian]], [[Sicilian language|Sicilian]], [[Venetian language|Venetian]], and others, and regional variants of these languages such as [[Calabrian languages|Calabrian]]. The division between dialect and language has been used by scholars (such as by [[Francesco Bruni]]) to distinguish between the languages that made up the Italian [[koine]], and those which had very little or no part in it, such as [[Albanian language|Albanian]], [[Greek language|Greek]], [[German language|German]], [[Ladin language|Ladin]], and [[Occitan language|Occitan]], which are still spoken by minorities.
Dialects are generally not used for general mass communication and are usually limited to native speakers in informal contexts. In the past, speaking in dialect was often deprecated as a sign of poor education. Younger generations, especially those under 35 (though it may vary in different areas), speak almost exclusively standard Italian in all situations, usually with local accents and idioms. Regional differences can be recognized by various factors: the openness of vowels, the length of the consonants, and influence of the local dialect (for example, ''annà'' replaces ''andare'' in the area of Rome for the infinitive "to go").
==Sounds==
{{IPA notice|lang=it}}
===Vowels===
Italian has seven [[vowel]] phonemes: {{IPA|/a/}}, {{IPA|/e/}}, {{IPA|/ɛ/}}, {{IPA|/i/}}, {{IPA|/o/}}, {{IPA|/ɔ/}}, {{IPA|/u/}}. The pairs {{IPA|/e/}}-{{IPA|/ɛ/}} and {{IPA|/o/}}-{{IPA|/ɔ/}} are seldom distinguished in writing and often confused, even though most varieties of Italian employ both phonemes consistently. Compare, for example: "perché" {{IPA|[perˈkɛ]}} (why, because) and "senti" {{IPA|[ˈsenti]}} (you listen, you are listening, listen!), employed by some northern speakers, with {{IPA|[perˈke]}} and {{IPA|[ˈsɛnti]}}, as pronounced by most central and southern speakers. As a result, the usage is strongly indicative of a person's origin. The standard (Tuscan) usage of these vowels is listed in vocabularies, and employed outside Tuscany mainly by specialists, especially actors and very few (television) journalists.
These are truly different [[phonemes]], however: compare {{IPA|/ˈpeska/}} (fishing) and {{IPA|/ˈpɛska/}} (peach), both spelled ''pesca'' . Similarly {{IPA|/ˈbotte/}} ('barrel') and {{IPA|/ˈbɔtte/}} ('beatings'), both spelled ''botte'', discriminate {{IPA|/o/}} and {{IPA|/ɔ/}} .
In general, vowel combinations usually pronounce each vowel separately. [[Diphthong]]s exist (e.g. ''uo'', ''iu'', ''ie'', ''ai''), but are limited to an unstressed ''u'' or ''i'' before or after a stressed vowel.
The unstressed ''u'' in a diphthong approximates the English semivowel ''w'', the unstressed ''i'' approximates the semivowel ''y''. E.g.: ''buono'' {{IPA|[ˈbwɔno]}}, ''ieri'' {{IPA|[ˈjɛri]}}. [[Triphthong]]s exist in Italian as well, like "contin''uia''mo" ("we continue"). Three vowel combinations exist only in the form semiconsonant ({{IPA|/j/}} or {{IPA|/w/}}), followed by a vowel, followed by a desinence vowel (usually {{IPA|/i/}}), as in ''miei'', ''suoi'', or two semiconsonants followed by a vowel, as the group ''-uia-'' exemplified above, or ''-iuo-'' in the word ''aiuola''.
===Mobile diphthongs===
Many Latin words with a short ''e'' or ''o'' have Italian counterparts with a mobile diphthong (''ie'' and ''uo'' respectively). When the vowel sound is stressed, it is pronounced and written as a diphthong; when not stressed, it is pronounced and written as a single vowel.
So Latin ''focus'' gave rise to Italian ''fuoco'' (meaning both "fire" and "optical focus"): when unstressed, as in ''focale'' ("focal") the "o" remains alone. Latin ''pes'' (more precisely its accusative form ''pedem'') is the source of Italian ''piede'' (foot): but unstressed "e" was left unchanged in ''pedone'' (pedestrian) and ''pedale'' (pedal). From Latin ''iocus'' comes Italian ''giuoco'' ("play", "game"), though in this case ''gioco'' is more common: ''giocare'' means "to play (a game)". From Latin ''homo'' comes Italian ''uomo'' (man), but also ''umano'' (human) and ''ominide'' (hominid). From Latin ''ovum'' comes Italian ''uovo'' (egg) and ''ovaie'' (ovaries). (The same phenomenon occurs in [[Spanish language|Spanish]]: ''juego'' (play, game) and ''jugar'' (to play), ''nieve'' (snow) and ''nevar'' (to snow)).
===Consonants===
Two symbols in a table cell denote the voiceless and voiced consonant, respectively.
Nasals undergo assimilation when followed by a consonant, e.g., when preceding a velar ({{IPA|/k/}} or {{IPA|/g/}}) only {{IPA|[ŋ]}} appears, etc.
Italian has geminate, or double, consonants, which are distinguished by [[Consonant length|length]]. Length is distinctive for all consonants except for {{IPA|/ʃ/}}, {{IPA|/ʦ/}}, {{IPA|/ʣ/}}, {{IPA|/ʎ/}} {{IPA|/ɲ/}}, which are always geminate, and {{IPA|/z/}} which is always single.
Geminate plosives and affricates are realised as lengthened closures. Geminate fricatives, nasals, and {{IPA|/l/}} are realized as lengthened [[continuant]]s. The flap consonant {{IPA|/ɾː/}} is typically dialectal, and it is called ''erre moscia''. The correct standard pronunciation is {{IPA|[r]}}.
Of special interest to the linguistic study of Italian is the ''[[Tuscan gorgia|Gorgia Toscana]]'', or "Tuscan Throat", the weakening or [[lenition]] of certain [[:wiktionary:intervocalic|intervocalic]] consonants in [[Tuscan dialect]]s. See also [[Syntactic doubling]].
===Assimilation===
Italian has few diphthongs, so most unfamiliar diphthongs that are heard in foreign words (in particular, those beginning with vowel "a", "e", or "o") will be assimilated as the corresponding [[diaeresis]] (i.e., the vowel sounds will be pronounced separately). Italian [[phonotactics]] do not usually permit polysyllabic nouns and verbs to end with consonants, excepting poetry and song, so foreign words may receive extra terminal vowel sounds.
==Grammar==
===Common variations in the writing systems===
Some variations in the usage of the writing system may be present in practical use. These are scorned by educated people, but they are so common in certain contexts that knowledge of them may be useful.
* Usage of ''x'' instead of ''per'': this is very common among teenagers and in [[Text messaging|SMS]] abbreviations. The multiplication operator is pronounced "per" in Italian, and so it is sometimes used to replace the word "per", which means "for"; thus, for example, "per te" ("for you") is shortened to "x te" (compare with English "4 U"). Words containing ''per'' can also have it replaced with ''x'': for example, ''perché'' (both "why" and "because") is often shortened as ''xché'' or ''xké'' or ''x' ''(see below). This usage might be useful to jot down quick notes or to fit more text into the low character limit of an SMS, but it is considered unacceptable in formal writing.
* Usage of foreign letters such as ''k'', ''j'' and ''y'', especially in nicknames and SMS language: ''ke'' instead of ''che'', ''Giusy'' instead of ''Giuseppina'' (or sometimes ''Giuseppe''). This is curiously mirrored in the usage of ''i'' in English names such as ''Staci'' instead of ''Stacey'', or in the usage of ''c'' in [[Northern Europe]] (''Jacob'' instead of ''Jakob''). The use of "k" instead of "ch" or "c" to represent a plosive sound is documented in some historical texts from before the standardization of the Italian language; however, that usage is no longer standard in Italian. Possibly because it is associated with the [[German language]], the letter "k" has sometimes also been used in satire to suggest that a political figure is an authoritarian or even a "pseudo-nazi": [[Francesco Cossiga]] was famously nicknamed ''Kossiga'' by rioting students during his tenure as minister of internal affairs. [Cf. the [[alternative political spelling#"K" replacing "C"|politicized spelling ''Amerika'']] in the USA.]
* Usage of the following abbreviations is limited to the electronic communications media and is deprecated in all other cases: '''nn''' instead of ''non'' (not), '''cmq''' instead of ''comunque'' (anyway, however), '''cm''' instead of ''come'' (how, like, as), '''d''' instead of ''di'' (of), '''(io/loro) sn''' instead of ''(io/loro) sono'' (I am/they are), '''(io) dv''' instead of ''(io) devo'' (I must/I have to) or instead of ''dove'' (where), '''(tu) 6''' instead of ''(tu) sei'' (you are).
* Inexperienced typists often replace accents with apostrophes, such as in ''perche''' instead of ''perché''. Uppercase ''[[È]]'' is particularly rare, as it is absent from the [[Keyboard layout#Italian|Italian keyboard layout]], and is very often written as ''E''' (even though there are [[:it:Aiuto:Manuale di stile#Scrivere .C3.88|several ways]] of producing the uppercase È on a computer). This never happens in books or other professionally typeset material.
==Samples==
==Examples==
*Cheers: "Salute!"
*English: ''inglese'' {{IPA|/iŋˈglese/}}
*Good-bye: ''arrivederci'' {{IPA|/arriveˈdertʃi/}}
*Hello: ''[[ciao]]'' {{IPA|/ˈtʃao/}}
*Good day: ''buon giorno'' {{IPA|/bwɔnˈdʒorno/}}
*Good evening: ''buona sera'' {{IPA|/bwɔnaˈsera/}}
*Yes: ''sì'' {{IPA|/si/}}
*No: ''no'' {{IPA|/nɔ/}}
*How are you? : Come stai {{IPA|/ˈkome ˈstai/}} (informal); Come sta {{IPA|/ˈkome 'sta/}} (formal)
*Sorry: ''mi dispiace'' {{IPA|/mi disˈpjatʃe/}}
*Excuse me: scusa {{IPA|/ˈskuza/}} (informal); scusi {{IPA|/ˈskuzi/}} (formal)
*Again: ''di nuovo'', /{{IPA|di ˈnwɔvo}}/; ''ancora'' /{{IPA|aŋˈkora}}/
*Always: ''sempre'' /{{IPA|ˈsɛmpre}}/
*When: ''quando'' {{IPA|/ˈkwando/}}
*Where: ''dove'' {{IPA|/'dove/}}
*Why/Because: ''perché'' {{IPA|/perˈke/}}
*How: ''come'' {{IPA|/'kome/}}
*How much is it?: ''quanto costa?'' {{IPA|/ˈkwanto/}}
*Thank you!: ''grazie!'' {{IPA|/ˈgrattsie/}}
*Bon appetit: ''buon appetito'' {{IPA|/ˌbwɔn appeˈtito/}}
*You're welcome!: ''prego!'' {{IPA|/ˈprɛgo/}}
*I love you: ''Ti amo'' {{IPA|/ti ˈamo/}}, ''Ti voglio bene'' {{IPA|/ti ˈvɔʎʎo ˈbɛne/}}. The difference is that you use "Ti amo" when you are in a romantic relationship, "Ti voglio bene" in any other occasion (to parents, to relatives, to friends...)
Counting to twenty:
*One: ''uno'' {{IPA|/ˈuno/}}
*Two: ''due'' {{IPA|/ˈdue/}}
*Three: ''tre'' {{IPA|/tre/}}
*Four: ''quattro'' {{IPA|/ˈkwattro/}}
*Five: ''cinque'' {{IPA|/ˈʧiŋkwe/}}
*Six: ''sei'' {{IPA|/ˈsɛi/}}
*Seven: ''sette'' {{IPA|/ˈsɛtte/}}
*Eight: ''otto'' {{IPA|/ˈɔtto/}}
*Nine: ''nove'' {{IPA|/ˈnɔve/}}
*Ten: ''dieci'' {{IPA|/ˈdjɛʧi/}}
*Eleven: ''undici'' {{IPA|/ˈundiʧi/}}
*Twelve: ''dodici'' {{IPA|/ˈdodiʧi/}}
*Thirteen: ''tredici'' {{IPA|/ˈtrediʧi/}}
*Fourteen: ''quattordici'' {{IPA|/kwat'tordiʧi/}}
*Fifteen: ''quindici'' {{IPA|/ˈkwindiʧi/}}
*Sixteen: ''sedici'' {{IPA|/ˈsediʧi/}}
*Seventeen: ''diciassette'' {{IPA|/diʧas'sɛtte/}}
*Eighteen: ''diciotto'' {{IPA|/di'ʧɔtto/}}
*Nineteen: ''diciannove'' {{IPA|/diʧan'nɔve/}}
*Twenty: ''venti'' {{IPA|/'venti/}}
The days of the week:
*Monday: ''lunedì'' {{IPA|/lune'di/}}
*Tuesday: ''martedì'' {{IPA|/marte'di/}}
*Wednesday: ''mercoledì'' {{IPA|/merkole'di/}}
*Thursday: ''giovedì'' {{IPA|/dʒove'di/}}
*Friday: ''venerdì'' {{IPA|/vener'di/}}
*Saturday: ''sabato'' {{IPA|/ˈsabato/}}
*Sunday: ''domenica'' {{IPA|/do'menika/}}
==Sample texts==
There is a recording of [[Dante]]'s [[Divine Comedy]] read by [[Lino Pertile]] available at http://etcweb.princeton.edu/dante/pdp/
Japanese language
{{Nihongo|'''Japanese'''|日本語 / にほんご |3=}} is a language spoken by over 130 million people in [[Japan]] and in Japanese emigrant communities. It is related to the [[Ryukyuan languages]], but whatever [[Classification of the Japanese language|relationships with other languages]] it may have remain undemonstrated. It is an [[agglutinative language]] and is distinguished by a complex system of [[Honorific speech in Japanese|honorifics]] reflecting the hierarchical nature of Japanese society, with verb forms and particular vocabulary to indicate the relative status of speaker, listener and the third person mentioned in conversation whether he is there or not. The sound inventory of Japanese is relatively small, and it has a lexically distinct [[Japanese pitch accent|pitch-accent]] system. It is a [[mora (linguistics)|mora]]-timed language.
The Japanese language is written with a combination of three different types of scripts: [[Chinese characters]] called ''[[kanji]]'' (漢字 / かんじ), and two [[syllabary|syllabic]] scripts made up of modified [[Chinese characters]], ''[[hiragana]]'' (平仮名 / ひらがな) and ''[[katakana]]'' (片仮名 / カタカナ). The [[Latin alphabet]], ''[[rōmaji]]'' (ローマ字), is also often used in modern Japanese, especially for company names and logos, advertising, and when entering Japanese text into a computer. Western style [[Arabic numerals]] are generally used for numbers, but traditional [[Sino-Japanese vocabulary|Sino-Japanese]] numerals are also commonplace.
Japanese [[vocabulary]] has been heavily influenced by [[loanword]]s from other languages. A vast number of words were borrowed from [[Chinese language|Chinese]], or created from Chinese models, over a period of at least 1,500 years. Since the late 19th century, Japanese has borrowed a considerable number of words from [[Indo-European languages]], primarily [[English language|English]]. Because of the special trade relationship between Japan and first [[Portugal]] in the 16th century, and then mainly the [[Netherlands]] in the 17th century, [[Portuguese language|Portuguese]], [[German language|German]] and [[Dutch language|Dutch]] have also been influential.
== Geographic distribution ==
Although Japanese is spoken almost exclusively in Japan, it has been and sometimes still is spoken elsewhere. When [[Imperial Japan|Japan]] occupied [[Korea]], [[Taiwan]], parts of the [[Chinese mainland]], and various Pacific islands before and during [[World War II]], locals in [[Greater East Asia Co-Prosperity Sphere|those countries]] were forced to learn Japanese in empire-building programs. As a result, there are many people in these countries who can speak Japanese in addition to the local languages. Japanese emigrant communities (the largest of which are to be found in [[Brazil]]) sometimes employ Japanese as their primary language. Approximately 5% of Hawaii residents speak Japanese, with Japanese ancestry the largest single ancestry in the state (over 24% of the population). Japanese emigrants can also be found in [[Peru]], [[Argentina]], [[Australia]] (especially [[Sydney]], [[Brisbane]], and [[Melbourne]]), the [[United States]] (notably [[California]], where 1.2% of the population has Japanese ancestry, and [[Hawaii]]), and the [[Philippines]] (particularly in [[Davao]] and [[Laguna (province)|Laguna]]). Their descendants, who are known as {{transl|ja|''[[nikkei]]''}} ({{lang|ja|日系}}, literally Japanese descendants), however, rarely speak Japanese fluently after the second generation. There are estimated to be several million non-Japanese studying the language as well.
=== Official status ===
Japanese is the de facto official language of Japan. There is a form of the language considered standard: {{nihongo|''hyōjungo''|標準語|}} Standard Japanese, or {{nihongo|''kyōtsūgo''|共通語|}} the common language. The meanings of the two terms are almost the same. {{transl|ja|''Hyōjungo''}} or {{transl|ja|''kyōtsūgo''}} is a conception that forms the counterpart of dialect. This normative language was born after the {{nihongo|[[Meiji Restoration]]|明治維新|meiji ishin|1868}} from the language spoken in uptown [[Tokyo]] for communicating necessity. {{transl|ja|''Hyōjungo''}} is taught in schools and used on television and in official communications, and is the version of Japanese discussed in this article.
Formerly, standard {{nihongo|Japanese in writing|文語|[[Bungo (Japanese language)|bungo]]|"literary language"}} was different from {{nihongo|colloquial language|口語|[[Kogo (Japanese language)|kōgo]]}}. The two systems have different rules of grammar and some variance in vocabulary. {{transl|ja|''Bungo''}} was the main method of writing Japanese until about 1900; since then {{transl|ja|''kōgo''}} gradually extended its influence and the two methods were both used in writing until the 1940s. {{transl|ja|''Bungo''}} still has some relevance for historians, literary scholars, and lawyers (many Japanese laws that survived [[World War II]] are still written in {{transl|ja|''bungo''}}, although there are ongoing efforts to modernize their language). {{transl|ja|''Kōgo''}} is the predominant method of both speaking and writing Japanese today, although {{transl|ja|''bungo''}} grammar and vocabulary are occasionally used in modern Japanese for effect.
=== Dialects ===
Dozens of dialects are spoken in Japan. The profusion is due to many factors, including the length of time the [[Japanese Archipelago|archipelago]] has been inhabited, its mountainous island terrain, and Japan's long history of both external and internal isolation. Dialects typically differ in terms of [[Japanese pitch accent|pitch accent]], inflectional [[morphology (linguistics)|morphology]], [[vocabulary]], and particle usage. Some even differ in [[vowel]] and [[consonant]] inventories, although this is uncommon.
The main distinction in Japanese accents is between {{nihongo|Tokyo-type|東京式|Tōkyō-shiki}} and {{nihongo|Kyoto-Osaka-type|京阪式|Keihan-shiki}}, though Kyūshū-type dialects form a third, smaller group. Within each type are several subdivisions. Kyoto-Osaka-type dialects are in the central region, with borders roughly formed by [[Toyama Prefecture|Toyama]], [[Kyoto Prefecture|Kyōto]], [[Hyōgo Prefecture|Hyōgo]], and [[Mie Prefecture|Mie]] Prefectures; most [[Shikoku]] dialects are also that type. The final category of dialects are those that are descended from the Eastern dialect of [[Old Japanese]]; these dialects are spoken in [[Hachijōjima|Hachijō-jima island]] and few islands.
Dialects from peripheral regions, such as [[Tōhoku Region|Tōhoku]] or [[Tsushima Island|Tsushima]], may be unintelligible to speakers from other parts of the country. The several dialects of [[Kagoshima Prefecture|Kagoshima]] in southern [[Kyūshū]] are famous for being unintelligible not only to speakers of standard Japanese but to speakers of nearby dialects elsewhere in Kyūshū as well. This is probably due in part to the Kagoshima dialects' peculiarities of pronunciation, which include the existence of closed syllables (i.e., syllables that end in a consonant, such as {{IPA|/kob/}} or {{IPA|/koʔ/}} for Standard Japanese {{IPA|/kumo/}} "spider"). A dialects group of [[Kansai region|Kansai]] is spoken and known by many Japanese, and [[Osaka]] dialect in particular is associated with comedy (See [[Kansai dialect]]). Dialects of Tōhoku and North [[Kantō region|Kantō]] are associated with typical farmers.
The [[Ryūkyūan languages]], spoken in [[Okinawa Prefecture|Okinawa]] and [[Amami Islands]] that are politically part of [[Kagoshima Prefecture|Kagoshima]], are distinct enough to be considered a separate branch of the [[Japonic languages|Japonic]] family. But many Japanese common people tend to consider the Ryūkyūan languages as dialects of Japanese. Not only is each language unintelligible to Japanese speakers, but most are unintelligible to those who speak other Ryūkyūan languages.
Recently, Standard Japanese has become prevalent nationwide (including the Ryūkyū islands) due to [[education]], [[mass media]], and increase of mobility networks within Japan, as well as economic integration.
== Sounds ==
{{IPA notice}}
Japanese vowels are "pure" sounds. The only unusual vowel is the high back vowel {{IPA|/ɯ/}} , which is like {{IPA|/u/}}, but [[roundedness|compressed]] instead of rounded. Japanese has five vowels, and [[vowel length]] is phonemic, so each one has both a short and a long version.
Some Japanese consonants have several [[allophone]]s, which may give the impression of a larger inventory of sounds. However, some of these allophones have since become phonemic. For example, in the Japanese language up to and including the first half of the twentieth century, the phonemic sequence {{IPA|/ti/}} was [[palatalization|palatalized]] and realized phonetically as {{IPA|[tɕi]}}, approximately ''chi'' ; however, now {{IPA|/ti/}} and {{IPA|/tɕi/}} are distinct, as evidenced by words like ''tī'' {{IPA|[tiː]}} "Western style tea" and ''chii'' {{IPA|[tɕii]}} "social status."
The 'r' of the Japanese language (technically a [[lateral consonant|lateral]] [[apical consonant|apical]] postalveolar flap), is of particular interest, sounding to most English speakers to be something between an 'l' and a [[retroflex consonant|retroflex]] 'r' depending on its position in a word.
The syllabic structure and the [[phonotactics]] are very simple: the only [[consonant cluster]]s allowed within a syllable consist of one of a subset of the consonants plus {{IPA|/j/}}. These type of clusters only occur in onsets. However, consonant clusters across syllables are allowed as long as the two consonants are a nasal followed by a [[homo-organic]] consonant. [[Consonant length]] (gemination) is also phonemic.
== Grammar ==
=== Sentence structure ===
Japanese word order is classified as [[Subject Object Verb]]. However, unlike many [[Indo-European language]]s, Japanese sentences only require that verbs come last for intelligibility. This is because the Japanese [[sentence element]]s are marked with [[Japanese particles|particles]] that identify their grammatical functions.
The basic sentence structure is [[topic-comment]]. For example, {{transl|ja|''Kochira-wa Tanaka-san desu''}} ({{lang|ja|こちらは田中さんです}}). {{transl|ja|''Kochira''}} ("this") is the topic of the sentence, indicated by the particle ''-wa''. The verb is {{transl|ja|''desu''}}, a [[copula]], commonly translated as "to be" or "it is" (though there are other verbs that can be translated as "to be"). As a phrase, {{transl|ja|''Tanaka-san desu''}} is the comment. This sentence loosely translates to "As for this person, (it) is Mr./Mrs./Miss Tanaka." Thus Japanese, like [[Chinese language|Chinese]], [[Korean language|Korean]], and many other Asian languages, is often called a [[topic-prominent language]], which means it has a strong tendency to indicate the topic separately from the subject, and the two do not always coincide. The sentence {{transl|ja|''Zō-wa hana-ga nagai (desu)''}} ({{lang|ja|象は鼻が長いです}}) literally means, "As for elephants, (their) noses are long". The topic is {{transl|ja|''zō''}} "elephant", and the subject is {{transl|ja|''hana''}} "nose".
Japanese is a [[pro-drop language]], meaning that the subject or object of a sentence need not be stated if it is obvious from context. In addition, it is commonly felt, particularly in spoken Japanese, that the shorter a sentence is, the better. As a result of this grammatical permissiveness and tendency towards brevity, Japanese speakers tend naturally to omit words from sentences, rather than refer to them with [[pronoun]]s. In the context of the above example, {{transl|ja|''hana-ga nagai''}} would mean "[their] noses are long," while {{transl|ja|''nagai''}} by itself would mean "[they] are long." A single verb can be a complete sentence: {{transl|ja|''Yatta!''}} "[I / we / they / etc] did [it]!". In addition, since adjectives can form the predicate in a Japanese sentence (below), a single adjective can be a complete sentence: {{transl|ja|''Urayamashii!''}} "[I'm] jealous [of it]!".
While the language has some words that are typically translated as pronouns, these are not used as frequently as pronouns in some [[Indo-European language]]s, and function differently. Instead, Japanese typically relies on special verb forms and auxiliary verbs to indicate the direction of benefit of an action: "down" to indicate the out-group gives a benefit to the in-group; and "up" to indicate the in-group gives a benefit to the out-group. Here, the in-group includes the speaker and the out-group doesn't, and their boundary depends on context. For example, {{transl|ja|''oshiete moratta''}} (literally, "explained" with a benefit from the out-group to the in-group) means "[he/she/they] explained it to [me/us]". Similarly, {{transl|ja|''oshiete ageta''}} (literally, "explained" with a benefit from the in-group to the out-group) means "[I/we] explained [it] to [him/her/them]". Such beneficiary auxiliary verbs thus serve a function comparable to that of pronouns and prepositions in Indo-European languages to indicate the actor and the recipient of an action.
Japanese "pronouns" also function differently from most modern Indo-European pronouns (and more like nouns) in that they can take modifiers as any other noun may. For instance, one cannot say in English:
: *The amazed he ran down the street. (grammatically incorrect)
But one ''can'' grammatically say essentially the same thing in Japanese:
: {{transl|ja|''Odoroita kare-wa michi-o hashitte itta.''}} (grammatically correct)
This is partly due to the fact that these words evolved from regular nouns, such as {{transl|ja|''kimi''}} "you" ({{lang|ja|君}} "lord"), {{transl|ja|''anata''}} "you" ({{lang|ja|あなた}} "that side, yonder"), and {{transl|ja|''boku''}} "I" ({{lang|ja|僕}} "servant"). This is why some linguists do not classify Japanese "pronouns" as pronouns, but rather as referential nouns. Japanese personal pronouns are generally used only in situations requiring special emphasis as to who is doing what to whom.
The choice of words used as pronouns is correlated with the sex of the speaker and the social situation in which they are spoken: men and women alike in a formal situation generally refer to themselves as {{transl|ja|''watashi''}} ({{lang|ja|私}} "private") or {{transl|ja|''watakushi''}} (also {{lang|ja|私}}), while men in rougher or intimate conversation are much more likely to use the word {{transl|ja|''ore''}} ({{lang|ja|俺}} "oneself", "myself") or {{transl|ja|''boku''}}. Similarly, different words such as {{transl|ja|''anata''}}, {{transl|ja|''kimi''}}, and {{transl|ja|''omae''}} ({{lang|ja|お前}}, more formally {{lang|ja|御前}} "the one before me") may be used to refer to a listener depending on the listener's relative social position and the degree of familiarity between the speaker and the listener. When used in different social relationships, the same word may have positive (intimate or respectful) or negative (distant or disrespectful) connotations.
Japanese often use titles of the person referred to where pronouns would be used in English. For example, when speaking to one's teacher, it is appropriate to use {{transl|ja|''sensei''}} ({{lang|ja|先生}}, teacher), but inappropriate to use {{transl|ja|''anata''}}. This is because {{transl|ja|''anata''}} is used to refer to people of equal or lower status, and one's teacher has allegedly higher status.
For English speaking learners of Japanese, a frequent beginners mistake is to include {{transl|ja|''watashi-wa''}} or {{transl|ja|''anata-wa''}} at the beginning of sentences as one would with ''I'' or ''you'' in English. Though these sentences are not grammatically incorrect, even in formal settings it would be considered unnatural and would equate in English to repeatedly using a noun where a [[pronoun]] would suffice.
=== Inflection and conjugation ===
Japanese nouns have no grammatical number, gender or article aspect. The noun {{transl|ja|''hon''}} ({{lang|ja|本}}) may refer to a single book or several books; {{transl|ja|''hito''}} ({{lang|ja|人}}) can mean "person" or "people"; and {{transl|ja|''ki''}} ({{lang|ja|木}}) can be "tree" or "trees". Where number is important, it can be indicated by providing a quantity (often with a [[Japanese counter word|counter word]]) or (rarely) by adding a suffix. Words for people are usually understood as singular. Thus {{transl|ja|''Tanaka-san''}} usually means ''Mr./Mrs./Miss. Tanaka''. Words that refer to people and animals can be made to indicate a group of individuals through the addition of a collective suffix (a noun suffix that indicates a group), such as {{transl|ja|''-tachi''}}, but this is not a true plural: the meaning is closer to the English phrase "and company". A group described as {{transl|ja|''Tanaka-san-tachi''}} may include people not named Tanaka. Some Japanese nouns are effectively plural, such as {{transl|ja|''hitobito''}} "people" and {{transl|ja|''wareware''}} "we/us", while the word {{transl|ja|''tomodachi''}} "friend" is considered singular, although plural in form.
Verbs are [[Japanese verb conjugations|conjugated]] to show tenses, of which there are two: past and present, or non-past, which is used for the present and the future. For verbs that represent an ongoing process, the ''-te iru'' form indicates a continuous (or progressive) tense. For others that represent a change of state, the {{transl|ja|''-te iru''}} form indicates a perfect tense. For example, {{transl|ja|''kite iru''}} means "He has come (and is still here)", but {{transl|ja|''tabete iru''}} means "He is eating".
Questions (both with an interrogative pronoun and yes/no questions) have the same structure as affirmative sentences, but with intonation rising at the end. In the formal register, the question particle {{transl|ja|''-ka''}} is added. For example, {{transl|ja|''Ii desu''}} ({{lang|ja|いいです。}}) "It is OK" becomes {{transl|ja|''Ii desu-ka''}} ({{lang|ja|いいですか?}}) "Is it OK?". In a more informal tone sometimes the particle {{transl|ja|''-no''}} ({{lang|ja|の}}) is added instead to show a personal interest of the speaker: {{transl|ja|''Dōshite konai-no?''}} "Why aren't (you) coming?". Some simple queries are formed simply by mentioning the topic with an interrogative intonation to call for the hearer's attention: {{transl|ja|''Kore-wa?''}} "(What about) this?"; {{transl|ja|''Namae-wa?''}} ({{lang|ja|名前は?}}) "(What's your) name?".
Negatives are formed by inflecting the verb. For example, {{transl|ja|''Pan-o taberu''}} ({{lang|ja|パンを食べる。}}) "I will eat bread" or "I eat bread" becomes {{transl|ja|''Pan-o tabenai''}} ({{lang|ja|パンを食べない。}}) "I will not eat bread" or "I do not eat bread".
The so-called {{transl|ja|''-te''}} verb form is used for a variety of purposes: either progressive or perfect aspect (see above); combining verbs in a temporal sequence ({{transl|ja|''Asagohan-o tabete sugu dekakeru''}} "I'll eat breakfast and leave at once"), simple commands, conditional statements and permissions ({{transl|ja|''Dekakete-mo ii?''}} "May I go out?"), etc.
The word {{transl|ja|''da''}} (plain), {{transl|ja|''desu''}} (polite) is the [[copula]] verb. It corresponds approximately to the English ''be'', but often takes on other roles, including a marker for tense, when the verb is conjugated into its past form {{transl|ja|''datta''}} (plain), {{transl|ja|''deshita''}} (polite). This comes into use because only {{transl|ja|''keiyōshi''}} adjectives and verbs can carry tense in Japanese. Two additional common verbs are used to indicate existence ("there is") or, in some contexts, property: {{transl|ja|''aru''}} (negative {{transl|ja|''nai''}}) and {{transl|ja|''iru''}} (negative {{transl|ja|''inai''}}), for inanimate and animate things, respectively. For example, {{transl|ja|''Neko ga iru''}} "There's a cat", {{transl|ja|''Ii kangae-ga nai''}} "[I] haven't got a good idea". Note that the negative forms of the verbs {{transl|ja|''iru''}} and {{transl|ja|''aru''}} are actually ''i''-adjectives and inflect as such, e.g. {{transl|ja|''Neko ga inakatta''}} "There was no cat".
The verb "to do" ({{transl|ja|''suru''}}, polite form {{transl|ja|''shimasu''}}) is often used to make verbs from nouns ({{transl|ja|''ryōri suru''}} "to cook", {{transl|ja|''benkyō suru''}} "to study", etc.) and has been productive in creating modern slang words. Japanese also has a huge number of compound verbs to express concepts that are described in English using a verb and a preposition (e.g. {{transl|ja|''tobidasu''}} "to fly out, to flee," from {{transl|ja|''tobu''}} "to fly, to jump" + {{transl|ja|''dasu''}} "to put out, to emit").
There are three types of [[Japanese adjectives|adjective]] (see also [[Japanese adjectives]]):
# {{lang|ja|形容詞}} {{transl|ja|''keiyōshi''}}, or {{transl|ja|''i''}} adjectives, which have a [[Japanese verb conjugations|conjugating]] ending {{transl|ja|''i''}} ({{lang|ja|い}}) (such as {{lang|ja|あつい}} {{transl|ja|''atsui''}} "to be hot") which can become past ({{lang|ja|あつかった}} {{transl|ja|''atsukatta''}} "it was hot"), or negative ({{lang|ja|あつくない}} {{transl|ja|''atsuku nai''}} "it is not hot"). Note that {{transl|ja|''nai''}} is also an {{transl|ja|''i''}} adjective, which can become past ({{lang|ja|あつくなかった}} {{transl|ja|''atsuku nakatta''}} "it was not hot").
#: {{lang|ja|暑い日}} {{transl|ja|''atsui hi''}} "a hot day".
# {{lang|ja|形容動詞}} {{transl|ja|''keiyōdōshi''}}, or {{transl|ja|''na''}} adjectives, which are followed by a form of the [[copula]], usually {{transl|ja|''na''}}. For example {{transl|ja|''hen''}} (strange)
#: {{lang|ja|変なひと}} {{transl|ja|''hen na hito''}} "a strange person".
# {{lang|ja|連体詞}} {{transl|ja|''rentaishi''}}, also called true adjectives, such as {{transl|ja|''ano''}} "that"
#: {{lang|ja|あの山}} {{transl|ja|''ano yama''}} "that mountain".
Both {{transl|ja|''keiyōshi''}} and {{transl|ja|''keiyōdōshi''}} may [[predicate (grammar)|predicate]] sentences. For example,
: {{lang|ja|ご飯が熱い。}} {{transl|ja|''Gohan-ga atsui.''}} "The rice is hot."
: {{lang|ja|彼は変だ。}} {{transl|ja|''Kare-wa hen da.''}} "He's strange."
Both inflect, though they do not show the full range of conjugation found in true verbs.
The {{transl|ja|''rentaishi''}} in Modern Japanese are few in number, and unlike the other words, are limited to directly modifying nouns. They never predicate sentences. Examples include {{transl|ja|''ookina''}} "big", {{transl|ja|''kono''}} "this", {{transl|ja|''iwayuru''}} "so-called" and {{transl|ja|''taishita''}} "amazing".
Both {{transl|ja|''keiyōdōshi''}} and {{transl|ja|''keiyōshi''}} form [[adverb]]s, by following with {{transl|ja|''ni''}} in the case of {{transl|ja|''keiyōdōshi''}}:
: {{lang|ja|変になる}} {{transl|ja|''hen ni naru''}} "become strange",
and by changing {{transl|ja|''i''}} to {{transl|ja|''ku''}} in the case of {{transl|ja|''keiyōshi''}}:
: {{lang|ja|熱くなる}} {{transl|ja|''atsuku naru''}} "become hot".
The grammatical function of nouns is indicated by [[postposition]]s, also called [[Japanese particles|particles]]. These include for example:
* '''{{lang|ja|が}} {{transl|ja|''ga''}}''' for the [[nominative case]]. Not necessarily a subject.
: {{lang|ja|''彼'''が'''やった。''}}{{transl|ja|''Kare '''ga''' yatta.''}} "'''He''' did it."
* '''{{lang|ja|に}} {{transl|ja|''ni''}}''' for the [[dative case]].
: {{lang|ja|田中さん'''に'''あげて下さい。}} {{transl|ja|''Tanaka-san '''ni''' agete kudasai''}} "Please give it to '''Mr. Tanaka'''."
It is also used for the [[lative]] case, indicating a motion to a location.
: {{lang|ja|''日本'' '''に'''行きたい。}} {{transl|ja|'''''Nihon''' '''ni''' ikitai''}} "I want to go ''to'' '''Japan'''."
* '''{{lang|ja|の}} {{transl|ja|''no''}}''' for the [[genitive case]], or nominalizing phrases.
: {{lang|ja|私'''の'''カメラ。}} {{transl|ja|''watashi '''no''' kamera''}} "'''my''' camera"
: {{lang|ja|スキーに行く'''の'''が好きです。}} {{transl|ja|''Sukī-ni iku '''no''' ga suki desu''}} "(I) like go'''ing''' skiing."
* '''{{lang|ja|を}} {{transl|ja|''o''}}''' for the [[accusative case]]. Not necessarily an object.
: {{lang|ja|何'''を'''食べますか。}} {{transl|ja|''Nani '''o''' tabemasu ka?''}} "'''What''' will (you) eat?"
* '''{{lang|ja|は}} {{transl|ja|''wa''}}''' for the topic. It can co-exist with case markers above except {{transl|ja|''no''}}, and it overrides {{transl|ja|''ga''}} and {{transl|ja|''o''}}.
: {{lang|ja|私'''は'''タイ料理がいいです。}} {{transl|ja|''Watashi '''wa''' tai-ryōri ga ii desu.''}} "As for me, Thai food is good." The nominative marker {{transl|ja|''ga''}} after {{transl|ja|''watashi''}} is hidden under {{transl|ja|''wa''}}. (Note that English generally makes no distinction between sentence topic and subject.)
Note: The difference between {{transl|ja|'''''wa'''''}} and {{transl|ja|'''''ga'''''}} goes beyond the English distinction between sentence topic and subject. While {{transl|ja|''wa''}} indicates the topic, which the rest of the sentence describes or acts upon, it carries the implication that the subject indicated by {{transl|ja|''wa''}} is not unique, or may be part of a larger group.
: {{transl|ja|''Ikeda-san '''wa''' yonjū-ni sai da.''}} "As for Mr. Ikeda, he is forty-two years old." Others in the group may also be of that age.
Absence of {{transl|ja|''wa''}} often means the subject is the [[focus (linguistics)|focus]] of the sentence.
: {{transl|ja|''Ikeda-san '''ga''' yonjū-ni sai da.''}} "It is Mr. Ikeda who is forty-two years old." This is a reply to an implicit or explicit question who in this group is forty-two years old.
=== Politeness ===
Unlike most western languages, Japanese has an extensive grammatical system to express politeness and formality.
Most relationships are not equal in Japanese [[society]]. The differences in social position are determined by a variety of factors including job, age, experience, or even psychological state (e.g., a person asking a favour tends to do so politely). The person in the lower position is expected to use a polite form of speech, whereas the other might use a more plain form. Strangers will also speak to each other politely. Japanese children rarely use polite speech until they are teens, at which point they are expected to begin speaking in a more adult manner. ''See [[uchi-soto]]''.
Whereas {{transl|ja|''teineigo''}} ({{lang|ja|丁寧語}}) (polite language) is commonly an [[inflection]]al system, {{transl|ja|''sonkeigo''}} ({{lang|ja|尊敬語}}) (respectful language) and {{transl|ja|''kenjōgo''}} ({{lang|ja|謙譲語}}) (humble language) often employ many special honorific and humble alternate verbs: {{transl|ja|''iku''}} "go" becomes {{transl|ja|''ikimasu''}} in polite form, but is replaced by {{transl|ja|''irassharu''}} in honorific speech and {{transl|ja|''ukagau''}} or {{transl|ja|''mairu''}} in humble speech.
The difference between honorific and humble speech is particularly pronounced in the Japanese language. Humble language is used to talk about oneself or one's own group (company, family) whilst honorific language is mostly used when describing the interlocutor and his/her group. For example, the {{transl|ja|''-san''}} suffix ("Mr" "Mrs." or "Miss") is an example of honorific language. It is not used to talk about oneself or when talking about someone from one's company to an external person, since the company is the speaker's "group". When speaking directly to one's superior in one's company or when speaking with other employees within one's company about a superior, a Japanese person will use vocabulary and inflections of the honorific register to refer to the in-group superior and his or her speech and actions. When speaking to a person from another company (i.e., a member of an out-group), however, a Japanese person will use the plain or the humble register to refer to the speech and actions of his or her own in-group superiors. In short, the register used in Japanese to refer to the person, speech, or actions of any particular individual varies depending on the relationship (either in-group or out-group) between the speaker and listener, as well as depending on the relative status of the speaker, listener, and third-person referents. For this reason, the Japanese system for explicit indication of social register is known as a system of "relative honorifics." This stands in stark contrast to the [[Korean language|Korean]] system of "absolute honorifics," in which the same register is used to refer to a particular individual (e.g. one's father, one's company president, etc.) in any context regardless of the relationship between the speaker and interlocutor. Thus, polite Korean speech can sound very presumptuous when translated verbatim into Japanese, as in Korean it is acceptable and normal to say things like "Our '''Mr.''' Company-President..." when communicating with a member of an out-group, which would be very inappropriate in a Japanese social context.
Most [[noun]]s in the Japanese language may be made polite by the addition of {{transl|ja|''o-''}} or {{transl|ja|''go-''}} as a prefix. {{transl|ja|''o-''}} is generally used for words of native Japanese origin, whereas {{transl|ja|''go-''}} is affixed to words of Chinese derivation. In some cases, the prefix has become a fixed part of the word, and is included even in regular speech, such as {{transl|ja|''gohan''}} 'cooked rice; meal.' Such a construction often indicates deference to either the item's owner or to the object itself. For example, the word {{transl|ja|''tomodachi''}} 'friend,' would become {{transl|ja|''o-tomodachi''}} when referring to the friend of someone of higher status (though mothers often use this form to refer to their children's friends). On the other hand, a polite speaker may sometimes refer to {{transl|ja|''mizu''}} 'water' as {{transl|ja|''o-mizu''}} in order to show politeness.
Most Japanese people employ politeness to indicate a lack of familiarity. That is, they use polite forms for new acquaintances, but if a relationship becomes more intimate, they no longer use them. This occurs regardless of age, social class, or gender.
== Vocabulary ==
The original language of Japan, or at least the original language of a certain population that was ancestral to a significant portion of the historical and present Japanese nation, was the so-called {{transl|ja|''yamato kotoba''}} ({{lang|ja|大和言葉}} or infrequently {{lang|ja|大和詞}}, i.e. "[[Yamato people|Yamato]] words"), which in scholarly contexts is sometimes referred to as {{transl|ja|''wa-go''}} ({{lang|ja|和語}} or rarely {{lang|ja|倭語}}, i.e. the {{transl|ja|"[[Wa (Japan)|Wa]]}} words"). In addition to words from this original language, present-day Japanese includes a great number of words that were either borrowed from [[Chinese language|Chinese]] or constructed from Chinese roots following Chinese patterns. These words, known as {{transl|ja|''[[Sino-Japanese vocabulary|kango]]''}} ({{lang|ja|漢語}}), entered the language from the fifth century onwards via contact with Chinese culture. According to a [[Japanese dictionary]] ''Shinsen-kokugojiten'' (新選国語辞典), [[Sino-Japanese vocabulary|Chinese-based words]] comprise 49.1% of the total vocabulary, Wago is 33.8% and other foreign words are 8.8%.
Like Latin-derived words in English, {{transl|ja|''[[Sino-Japanese vocabulary|kango]]''}} words typically are perceived as somewhat formal or academic compared to equivalent Yamato words. Indeed, it is generally fair to say that an English word derived from Latin/French roots typically corresponds to a Sino-Japanese word in Japanese, whereas a simpler Anglo-Saxon word would best be translated by a Yamato equivalent.
A much smaller number of words has been borrowed from [[Korean language|Korean]] and [[Ainu language|Ainu]]. Japan has also borrowed a number of words from other languages, particularly ones of European extraction, which are called {{transl|ja|''[[gairaigo]]''}}. This began with [[Japanese words of Portuguese origin|borrowings from Portuguese]] in the 16th century, followed by borrowing from [[Dutch language|Dutch]] during Japan's [[sakoku|long isolation]] of the [[Edo period]]. With the [[Meiji Restoration]] and the reopening of Japan in the 19th century, borrowing occurred from [[German language|German]], [[French language|French]] and [[English language|English]]. Currently, words of English origin are the most commonly borrowed.
In the Meiji era, the Japanese also coined many neologisms using Chinese roots and morphology to translate Western concepts. The Chinese and Koreans imported many of these pseudo-Chinese words into [[Chinese language|Chinese]], [[Korean language|Korean]], and [[Vietnamese language|Vietnamese]] via their [[kanji]] in the late 19th and early 20th centuries. For example, {{lang|ja|政治}} {{transl|ja|''seiji''}} ("politics"), and {{lang|ja|化学}} {{transl|ja|''kagaku''}} ("chemistry") are words derived from Chinese roots that were first created and used by the Japanese, and only later borrowed into Chinese and other East Asian languages. As a result, Japanese, Chinese, Korean, and Vietnamese share a large common corpus of vocabulary in the same way a large number of Greek- and Latin-derived words are shared among modern European languages, although many academic words formed from such roots were certainly coined by native speakers of other languages, such as English.
In the past few decades, {{transl|ja|''[[wasei-eigo]]''}} (made-in-Japan English) has become a prominent phenomenon. Words such as {{transl|ja|''wanpatān''}} {{lang|ja|ワンパターン}} (< ''one'' + ''pattern'', "to be in a rut", "to have a one-track mind") and {{transl|ja|''sukinshippu''}} {{lang|ja|スキンシップ}} (< ''skin'' + ''-ship'', "physical contact"), although coined by compounding English roots, are nonsensical in most non-Japanese contexts; exceptions exist in nearby languages such as Korean however, which often use words such as skinship and rimokon (remote control) in the same way as in Japanese.
Additionally, many native Japanese words have become commonplace in English, due to the popularity of many Japanese cultural exports. Words such as [[futon]], [[haiku]], [[judo]], [[kamikaze]], [[karaoke]], [[karate]], [[ninja]], [[origami]], [[rickshaw]] (from {{lang|ja|人力車}} {{transl|ja|''jinrikisha''}}), [[samurai]], [[sayonara]], [[sumo]], [[sushi]], [[tsunami]], [[tycoon]] and many others have become part of the English language. See [[list of English words of Japanese origin]] for more.
== Writing system ==
Literacy was introduced to Japan in the form of the [[Chinese writing system]], by way of [[Baekje]] before the 5th century. Using this language, the Japanese emperor [[Emperor Yūryaku|Yūryaku]] sent a letter to a Chinese emperor [[Emperor Shun of Liu Song|Liu Song]] in 478 CE. After the ruin of Baekje, Japan invited scholars from China to learn more of the Chinese writing system. Japanese Emperors gave an official rank to Chinese scholars (続守言/薩弘格/袁晋卿) and spread the use of Chinese characters from the 7th century to the 8th century.
At first, the Japanese wrote in [[Classical Chinese]], with Japanese names represented by characters used for their meanings and not their sounds. Later, during the seventh century CE, the Chinese-sounding phoneme principle was used to write pure Japanese poetry and prose (comparable to Akkadian's retention of Sumerian cuneiform), but some Japanese words were still written with characters for their meaning and not the original Chinese sound. This is when the history of Japanese as a written language begins in its own right. By this time, the Japanese language was already distinct from the [[Ryukyuan languages]].
The Korean settlers and their descendants used Kudara-on or Baekje pronunciation (百済音), which was also called Tsushima-pronunciation (対馬音) or [[Go-on]] (呉音).
An example of this mixed style is the [[Kojiki]], which was written in 712 AD. They then started to use Chinese characters to write Japanese in a style known as {{transl|ja|''man'yōgana''}}, a syllabic script which used Chinese characters for their sounds in order to transcribe the words of Japanese speech syllable by syllable.
Over time, a writing system evolved. [[Chinese characters]] ([[kanji]]) were used to write either words borrowed from Chinese, or Japanese words with the same or similar meanings. Chinese characters were also used to write grammatical elements, were simplified, and eventually became two syllabic scripts: [[hiragana]] and [[katakana]].
Modern Japanese is written in a mixture of three main systems: [[kanji]], characters of Chinese origin used to represent both Chinese [[loanword]]s into Japanese and a number of native Japanese [[morpheme]]s; and two [[syllabary|syllabaries]]: [[hiragana]] and [[katakana]]. The [[Latin alphabet]] is also sometimes used. Arabic numerals are much more common than the kanji when used in counting, but kanji numerals are still used in compounds, such as {{lang|ja|統一}} {{transl|ja|''tōitsu''}} ("unification").
''[[Hiragana]]'' are used for words without kanji representation, for words no longer written in kanji, and also following kanji to show conjugational endings. Because of the way verbs (and adjectives) in Japanese are [[conjugated]], kanji alone cannot fully convey Japanese tense and mood, as kanji cannot be subject to variation when written without losing its meaning. For this reason, hiragana are suffixed to the ends of kanji to show verb and adjective conjugations. Hiragana used in this way are called [[okurigana]]. Hiragana are also written in a superscript called [[furigana]] above or beside a kanji to show the proper reading. This is done to facilitate learning, as well as to clarify particularly old or obscure (or sometimes invented) readings.
''[[Katakana]]'', like hiragana, are a syllabary; katakana are primarily used to write foreign words, plant and animal names, and for emphasis. For example "Australia" has been adapted as {{transl|ja|''Ōsutoraria''}} ({{lang|ja|オーストラリア}}), and "supermarket" has been adapted and shortened into {{transl|ja|''sūpā''}} ({{lang|ja|スーパー}}). The [[Latin alphabet]] (in Japanese referred to as [[romaji|''Rōmaji'']] ({{lang|ja|ローマ字}}), literally "Roman letters") is used for some loan words like "CD" and "DVD", and also for some Japanese creations like "Sony".
Historically, attempts to limit the number of kanji in use commenced in the mid-19th century, but did not become a matter of government intervention until after Japan's defeat in the Second World War. During the period of post-war occupation (and influenced by the views of some U.S. officials), various schemes including the complete abolition of kanji and exclusive use of rōmaji were considered. The {{transl|ja|''[[jōyō kanji]]''}} ("common use kanji", originally called {{transl|ja|''[[tōyō kanji]]''}} [kanji for general use]) scheme arose as a compromise solution.
Japanese students begin to learn kanji from their first year at elementary school. A guideline created by the Japanese Ministry of Education, the list of {{transl|ja|''[[kyōiku kanji]]''}} ("education kanji", a subset of {{transl|ja|''[[jōyō kanji]]''}}), specifies the 1,006 simple characters a child is to learn by the end of sixth grade. Children continue to study another 939 characters in junior high school, covering in total 1,945 {{transl|ja|''[[jōyō kanji]]''}}. The official list of {{transl|ja|''[[jōyō kanji]]''}} was revised several times, but the total number of officially sanctioned characters remained largely unchanged.
As for kanji for personal names, the circumstances are somewhat complicated. {{transl|ja|''[[Jōyō kanji]]''}} and {{transl|ja|''[[jinmeiyō kanji]]''}} (an appendix of additional characters for names) are approved for registering personal names. Names containing unapproved characters are denied registration. However, as with the list of {{transl|ja|''[[jōyō kanji]]''}}, criteria for inclusion were often arbitrary and led to many common and popular characters being disapproved for use. Under popular pressure and following a court decision holding the exclusion of common characters unlawful, the list of {{transl|ja|''[[jinmeiyō kanji]]''}} was substantially extended from 92 in 1951 (the year it was first decreed) to 983 in 2004. Furthermore, families whose names are not on these lists were permitted to continue using the older forms.
Many writers rely on [[newspaper]] circulation to publish their work with officially sanctioned characters. This distribution method is more efficient than traditional [[pen]] and [[paper]] publications.
==Study by non-native speakers==
Many major universities throughout the world provide Japanese language courses, and a number of secondary and even primary schools worldwide offer courses in the language. International interest in the Japanese language dates from the 1800s but has become more prevalent following Japan's economic bubble of the 1980s and the global popularity of [[Japanese pop culture]] (such as [[anime]] and [[video games]]) since the 1990s. About 2.3 million people studied the language worldwide in 2003: 900,000 South [[Koreans]], 389,000 [[People's Republic of China|Chinese]], 381,000 [[Australians]], and 140,000 [[United States|Americans]] study Japanese in lower and higher educational institutions.
In Japan, more than 90,000 foreign students study at [[List of universities in Japan|Japanese universities]] and Japanese [[language school]]s, including 77,000 Chinese and 15,000 South Koreans in 2003. In addition, local governments and some [[non-profit organisation|NPO]] groups provide free Japanese language classes for foreign residents, including [[Japanese Brazilians]] and foreigners married to Japanese nationals. In the United Kingdom, studies are supported by the [[British Association for Japanese Studies]]. In Ireland, Japanese is offered as a language in the [[Leaving Certificate]] in some schools.
The Japanese government provides standardised tests to measure spoken and written comprehension of Japanese for second language learners; the most prominent is the [[Japanese Language Proficiency Test]] (JLPT). The Japanese External Trade Organisation [[JETRO]] organises the ''Business Japanese Proficiency Test'' which tests the learner's ability to understand Japanese in a business setting.
When learning Japanese in a college setting, students are usually first taught how to pronounce [[romaji]]. From that point, they are taught the two main syllabaries, with [[kanji]] usually being introduced in the second semester. Focus is usually first on polite (distal) speech, as students that might interact with native speakers would be expected to use. Casual speech and formal speech usually follow polite speech, as well as the usage of honourifics.
Java (programming language)
'''Java''' is a [[programming language]] originally developed by [[Sun Microsystems]] and released in 1995 as a core component of Sun Microsystems' [[Java (Sun)|Java platform]]. The language derives much of its [[Syntax of programming languages|syntax]] from [[C (programming language)|C]] and [[C++]] but has a simpler [[object model]] and fewer low-level facilities. Java applications are typically [[compiler|compiled]] to [[bytecode]] that can run on any [[Java virtual machine]] (JVM) regardless of [[computer architecture]].
The original and [[reference implementation]] Java [[compiler]]s, virtual machines, and [[library (computing)|class libraries]] were developed by Sun from 1995. As of May 2007, in compliance with the specifications of the [[Java Community Process]], Sun made available most of their Java technologies as [[free software]] under the [[GNU General Public License]]. Others have also developed alternative implementations of these Sun technologies, such as the [[GNU Compiler for Java]] and [[GNU Classpath]].
== History ==
The Java language was created by [[James Gosling]] in June 1991 for use in one of his many [[set-top box]]
projects. The language was initially called ''Oak'', after an [[oak tree]] that stood outside Gosling's office—and also went by the name ''Green''—and ended up later being renamed to ''Java'', from a list of random words. Gosling's goals were to implement a [[virtual machine]] and a language that had a familiar C/C++ style of notation. The first public implementation was Java 1.0 in 1995. It promised "[[Write once, run anywhere|Write Once, Run Anywhere]]" (WORA), providing no-cost runtimes on popular platforms. It was fairly secure and its security was configurable, allowing network and file access to be restricted. Major web browsers soon incorporated the ability to run secure Java ''[[applet]]s'' within web pages. Java quickly became popular. With the advent of ''Java 2'', new versions had multiple configurations built for different types of platforms. For example, ''[[J2EE]]'' was for enterprise applications and the greatly stripped down version ''[[J2ME]]'' was for mobile applications. ''[[J2SE]]'' was the designation for the Standard Edition. In 2006, for marketing purposes, new ''J2'' versions were renamed ''Java EE'', ''Java ME'', and ''Java SE'', respectively.
In 1997, Sun Microsystems approached the [[International Organization for Standardization#JTC1|ISO/IEC JTC1 standards body]] and later the [[Ecma International]] to formalize Java, but it soon withdrew from the process. Java remains a [[de facto]] standard that is controlled through the [[Java Community Process]]. At one time, Sun made most of its Java implementations available without charge although they were [[proprietary software]]. Sun's revenue from Java was generated by the selling of licenses for specialized products such as the Java Enterprise System. Sun distinguishes between its [[Software Development Kit|Software Development Kit (SDK)]] and [[HotSpot|Runtime Environment (JRE)]] that is a subset of the SDK, the primary distinction being that in the JRE, the compiler, utility programs, and many necessary header files are not present.
On [[13 November]] [[2006]], Sun released much of Java as [[free software|free]] and [[open-source software|open-source]] software under the terms of the [[GNU General Public License]] (GPL). On [[8 May]] [[2007]] Sun finished the process, making all of Java's core code free and open-source, aside from a small portion of code to which Sun did not hold the copyright.
== Philosophy ==
=== Primary goals ===
There were five primary goals in the creation of the Java language:
# It should use the [[object-oriented programming]] methodology.
# It should allow the same program to be [[execution (computers)|executed]] on multiple [[operating system]]s.
# It should contain built-in support for using [[computer network]]s.
# It should be designed to execute code from [[remote procedure call|remote source]]s securely.
# It should be easy to use by selecting what were considered the good parts of other object-oriented languages.
=== Platform independence ===
One characteristic, [[Cross-platform|platform independence]], means that [[computer program|program]]s written in the Java language must run similarly on any supported hardware/operating-system platform. One should be able to write a program once, compile it once, and run it anywhere.
This is achieved by most Java [[compiler]]s by compiling the Java language code ''halfway'' (to [[Java bytecode]]) – simplified machine instructions specific to the Java platform. The code is then run on a [[virtual machine]] (VM), a program written in native code on the host hardware that [[Interpreter (computing)|interprets]] and executes generic Java bytecode. (In some JVM versions, bytecode can also be compiled to native code, either before or during program execution, resulting in faster execution.) Further, standardized libraries are provided to allow access to features of the host machines (such as graphics, [[thread (computer science)|threading]] and [[Computer network|networking]]) in unified ways. Note that, although there is an explicit compiling stage, at some point, the Java bytecode is interpreted or converted to native [[machine code]] by the [[Just-in-time compilation|JIT compiler]].
The first implementations of the language used an interpreted virtual machine to achieve [[Porting|portability]]. These implementations produced programs that ran slower than programs compiled to native executables, for instance written in C or C++, so the language suffered a reputation for poor performance. More recent JVM implementations produce programs that run significantly faster than before, using multiple techniques.
One technique, known as ''just-in-time compilation'' (JIT), translates the Java bytecode into native code at the time that the program is run, which results in a program that executes faster than interpreted code but also incurs compilation overhead during execution. More sophisticated VMs use ''[[dynamic recompilation]]'', in which the VM can analyze the behavior of the running program and selectively recompile and optimize critical parts of the program. Dynamic recompilation can achieve optimizations superior to static compilation because the dynamic compiler can base optimizations on knowledge about the runtime environment and the set of loaded classes, and can identify the ''hot spots'' (parts of the program, often inner loops, that take up the most execution time). JIT compilation and dynamic recompilation allow Java programs to take advantage of the speed of native code without losing portability.
Another technique, commonly known as ''static compilation'', is to compile directly into native code like a more traditional compiler. Static Java compilers, such as [[GCJ]], translate the Java language code to native [[object code]], removing the intermediate bytecode stage. This achieves good performance compared to interpretation, but at the expense of portability; the output of these compilers can only be run on a single [[Computer architecture|architecture]]. Some see avoiding the VM in this manner as defeating the point of developing in Java; however it can be useful to provide both a generic [[bytecode]] version, as well as an optimised native code version of an application.
=== Implementations ===
Sun Microsystems officially licenses the Java Standard Edition platform for [[Microsoft Windows]], [[Linux]], and [[Solaris (operating system)|Solaris]]. Through a network of third-party vendors and licensees, alternative Java environments are available for these and other platforms. To qualify as a certified Java licensee, an implementation on any particular platform must pass a rigorous suite of validation and compatibility tests. This method enables a guaranteed level of compliance and platform through a trusted set of commercial and non-commercial partners.
Sun's trademark license for usage of the Java brand insists that all implementations be "compatible". This resulted in a legal dispute with [[Microsoft]] after Sun claimed that the Microsoft implementation did not support the [[Java remote method invocation|RMI]] and [[Java Native Interface|JNI]] interfaces and had added platform-specific features of their own. Sun sued in 1997, and in 2001 won a settlement of $20 million as well as a court order enforcing the terms of the license from Sun. As a result, Microsoft no longer ships Java with [[Microsoft Windows|Windows]], and in recent versions of Windows, [[Internet Explorer]] cannot support Java applets without a third-party plugin. However, Sun and others have made available Java run-time systems at no cost for those and other versions of Windows.
Platform-independent Java is essential to the [[Java Enterprise Edition]] strategy, and an even more rigorous validation is required to certify an implementation. This environment enables portable server-side applications, such as [[Web service]]s, [[servlet]]s, and [[Enterprise JavaBean]]s, as well as with [[Embedded system]]s based on [[OSGi]], using [[Embedded Java]] environments. Through the new [[GlassFish]] project, Sun is working to create a fully functional, unified [[open-source]] implementation of the Java EE technologies.
=== Automatic memory management ===
One of the ideas behind Java's automatic memory management model is that programmers be spared the burden of having to perform manual memory management. In some languages the programmer allocates memory for the creation of objects stored on the [[heap]] and the responsibility of later deallocating that memory also resides with the programmer. If the programmer forgets to deallocate memory or writes code that fails to do so, a [[memory leak]] occurs and the program can consume an arbitrarily large amount of memory. Additionally, if the program attempts to deallocate the region of memory more than once, the result is undefined and the program may become unstable and may crash. Finally, in non garbage collected environments, there is a certain degree of overhead and complexity of user-code to track and finalize allocations. Often developers may box themselves into certain designs to provide reasonable assurances that memory leaks will not occur.
In Java, this potential problem is avoided by [[automatic garbage collection]]. The programmer determines when objects are created, and the Java runtime is responsible for managing the [[object lifetime|object's lifecycle]]. The program or other objects can reference an object by holding a reference to it (which, from a low-level point of view, is its address on the heap). When no references to an object remain, the [[unreachable object]] is eligible for release by the Java garbage collector - it may be freed automatically by the garbage collector at any time. Memory leaks may still occur if a programmer's code holds a reference to an object that is no longer needed—in other words, they can still occur but at higher conceptual levels.
The use of garbage collection in a language can also affect programming paradigms. If, for example, the developer assumes that the cost of memory allocation/recollection is low, they may choose to more freely construct objects instead of pre-initializing, holding and reusing them. With the small cost of potential performance penalties (inner-loop construction of large/complex objects), this facilitates thread-isolation (no need to synchronize as different threads work on different object instances) and data-hiding. The use of transient immutable value-objects minimizes side-effect programming.
Comparing Java and [[C++]], it is possible in C++ to implement similar functionality (for example, a memory management model for specific classes can be designed in C++ to improve speed and lower memory fragmentation considerably), with the possible cost of adding comparable runtime overhead to that of Java's garbage collector, and of added development time and application complexity if one favors manual implementation over using an existing third-party library. In Java, garbage collection is built-in and virtually invisible to the developer. That is, developers may have no notion of when garbage collection will take place as it may not necessarily correlate with any actions being explicitly performed by the code they write. Depending on intended application, this can be beneficial or disadvantageous: the programmer is freed from performing low-level tasks, but at the same time loses the option of writing lower level code. Additionally, the garbage collection capability demands some attention to tuning the JVM, as large heaps will cause apparently random stalls in performance.
Java does not support [[pointer (computing)|pointer arithmetic]] as is supported in, for example, C++. This is because the garbage collector may relocate referenced objects, invalidating such pointers. Another reason that Java forbids this is that type safety and security can no longer be guaranteed if arbitrary manipulation of pointers is allowed.
== Syntax ==
The syntax of Java is largely derived from [[C++]]. Unlike C++, which combines the syntax for structured, generic, and object-oriented programming, Java was built exclusively as an object oriented language. As a result, almost everything is an object and all code is written inside a class. The exceptions are the intrinsic data types (ordinal and real numbers, boolean values, and characters), which are not classes for performance reasons.
=== Hello, world program ===
This is a minimal [[Hello world program]] in Java with [[syntax highlighting]]:
To execute a Java program, the code is saved as a file named Hello.java. It must first be compiled into bytecode using a [[Java compiler]], which produces a file named Hello.class. This class is then ''launched''.
The above example merits a bit of explanation.
* All executable statements in Java are written inside a class, including stand-alone programs.
* Source files are by convention named the same as the class they contain, appending the mandatory suffix ''.java''. A '''class''' that is declared '''public''' is required to follow this convention. (In this case, the class '''Hello''' is public, therefore the source must be stored in a file called ''Hello.java'').
* The compiler will generate a class file for each class defined in the source file. The name of the class file is the name of the class, with ''.class'' appended. For class file generation, anonymous classes are treated as if their name was the concatenation of the name of their enclosing class, a ''$'', and an integer.
* The [[Java keywords|keyword]] '''public''' denotes that a method can be called from code in other classes, or that a class may be used by classes outside the class hierarchy.
* The keyword '''static''' indicates that the method is a [[class method|static method]], associated with the class rather than object instances.
* The keyword '''void''' indicates that the main method does not return any value to the caller.
* The method name "main" is not a keyword in the Java language. It is simply the name of the method the Java launcher calls to pass control to the program. Java classes that run in managed environments such as applets and [[Enterprise Java Beans]] do not use or need a main() method.
* The main method must accept an [[array]] of '''{{Javadoc:SE|java/lang|String}}''' objects. By convention, it is referenced as '''args''' although any other legal identifier name can be used. Since Java 5, the main method can also use [[varargs|variable arguments]], in the form of public static void main(String... args), allowing the main method to be invoked with an arbitrary number of String arguments. The effect of this alternate declaration is semantically identical (the args parameter is still an array of String objects), but allows an alternate syntax for creating and passing the array.
* The Java launcher launches Java by loading a given class (specified on the command line) and starting its public static void main(String[]) method. Stand-alone programs must declare this method explicitly. The String[] args parameter is an [[array]] of {{Javadoc:SE|java/lang|String}} objects containing any arguments passed to the class. The parameters to main are often passed by means of a [[command line]].
* The printing facility is part of the Java standard library: The '''{{Javadoc:SE|java/lang|System}}''' class defines a public static field called '''{{Javadoc:SE|name=out|java/lang|System|out}}'''. The out object is an instance of the {{Javadoc:SE|java/io|PrintStream}} class and provides the method '''{{Javadoc:SE|name=println(String)|java/io|PrintStream|println(java.lang.String)}}''' for displaying data to the screen while creating a new line ([[standard streams|standard out]]).
=== A more comprehensive example ===
* The '''[[Java keywords#import|import]]''' statement imports the '''{{Javadoc:SE|javax/swing|JOptionPane}}''' class from the '''{{Javadoc:SE|package=javax.swing|javax/swing}}''' package.
* The '''OddEven''' class declares a single '''[[Java keywords#private|private]]''' [[field (computer science)|field]] of type '''int''' named '''input'''. Every instance of the OddEven class has its own copy of the input field. The private declaration means that no other class can access (read or write) the input field.
* '''OddEven()''' is a '''public''' [[constructor (computer science)|constructor]]. Constructors have the same name as the enclosing class they are declared in, and unlike a method, have no [[return type]]. A constructor is used to initialize an [[object (computer science)|object]] that is a newly created instance of the class. The dialog returns a String that is converted to an int by the '''{{Javadoc:SE|java/lang|Integer|parseInt(String)}}''' method.
* The '''calculate()''' method is declared without the static keyword. This means that the method is invoked using a specific instance of the OddEven class. (The [[reference (computer science)|reference]] used to invoke the method is passed as an undeclared parameter of type OddEven named '''[[Java keywords#this|this]]'''.) The method tests the expression input % 2 == 0 using the '''[[Java keywords#if|if]]''' keyword to see if the remainder of dividing the input field belonging to the instance of the class by two is zero. If this expression is true, then it prints '''Even'''; if this expression is false it prints '''Odd'''. (The input field can be equivalently accessed as this.input, which explicitly uses the undeclared this parameter.)
* '''OddEven number = new OddEven();''' declares a local object [[reference (computer science)|reference]] variable in the main method named number. This variable can hold a reference to an object of type OddEven. The declaration initializes number by first creating an instance of the OddEven class, using the '''[[Java keywords#new|new]]''' keyword and the OddEven() constructor, and then assigning this instance to the variable.
* The statement '''number.showDialog();''' calls the calculate method. The instance of OddEven object referenced by the number [[local variable]] is used to invoke the method and passed as the undeclared this parameter to the calculate method.
* For simplicity, [[error handling]] has been ignored in this example. Entering a value that is not a number will cause the program to crash. This can be avoided by catching and handling the {{Javadoc:SE|java/lang|NumberFormatException}} thrown by Integer.parseInt(String).
=== Applet ===
Java applets are programs that are embedded in other applications, typically in a Web page displayed in a [[Web browser]].
The '''import''' statements direct the [[Java compiler]] to include the '''{{Javadoc:SE|package=java.applet|java/applet|Applet}}''' and '''{{Javadoc:SE|package=java.awt|java/awt|Graphics}}''' classes in the compilation. The import statement allows these classes to be referenced in the [[source code]] using the ''simple class name'' (i.e. Applet) instead of the ''fully qualified class name'' (i.e. java.applet.Applet).
The Hello class '''extends''' ([[subclass (computer science)|subclasses]]) the '''Applet''' class; the Applet class provides the framework for the host application to display and control the [[Object lifetime|lifecycle]] of the applet. The Applet class is an [[Abstract Windowing Toolkit]] (AWT) {{Javadoc:SE|java/awt|Component}}, which provides the applet with the capability to display a [[graphical user interface]] (GUI) and respond to user [[event-driven programming|events]].
The Hello class [[method overriding (programming)|overrides]] the '''{{Javadoc:SE|name=paint(Graphics)|java/awt|Container|paint(java.awt.Graphics)}}''' method inherited from the {{Javadoc:SE|java/awt|Container}} [[superclass (computer science)|superclass]] to provide the code to display the applet. The paint() method is passed a '''Graphics''' object that contains the graphic context used to display the applet. The paint() method calls the graphic context '''{{Javadoc:SE|name=drawString(String, int, int)|java/awt|Graphics|drawString(java.lang.String,%20int,%20int)}}''' method to display the '''"Hello, world!"''' string at a [[pixel]] offset of ('''65, 95''') from the upper-left corner in the applet's display.
An applet is placed in an [[HTML]] document using the '''''' [[HTML element]]. The applet tag has three attributes set: '''code="Hello"''' specifies the name of the Applet class and '''width="200" height="200"''' sets the pixel width and height of the applet. Applets may also be embedded in HTML using either the object or embed element, although support for these elements by Web browsers is inconsistent. However, the applet tag is deprecated, so the object tag is preferred where supported.
The host application, typically a Web browser, instantiates the '''Hello''' applet and creates an {{Javadoc:SE|java/applet|AppletContext}} for the applet. Once the applet has initialized itself, it is added to the AWT display hierarchy. The paint method is called by the AWT [[event dispatching thread]] whenever the display needs the applet to draw itself.
=== '''Servlet''' ===
Java Servlet technology provides Web developers with a simple, consistent mechanism for extending the functionality of a Web server and for accessing existing business systems. Servlets are [[server-side]] Java EE components that generate responses (typically [[HTML]] pages) to requests (typically [[HTTP]] requests) from [[client (computing)|client]]s. A servlet can almost be thought of as an applet that runs on the server side—without a face.
The '''import''' statements direct the Java compiler to include all of the public classes and [[interface (Java)|interfaces]] from the '''{{Javadoc:SE|package=java.io|java/io}}''' and '''{{Javadoc:EE|package=javax.servlet|javax/servlet}}''' [[Java package|packages]] in the compilation.
The '''Hello''' class '''extends''' the '''{{Javadoc:EE|javax/servlet|GenericServlet}}''' class; the GenericServlet class provides the interface for the [[server (computing)|server]] to forward requests to the servlet and control the servlet's lifecycle.
The Hello class overrides the '''{{Javadoc:EE|name=service(ServletRequest, ServletResponse)|javax/servlet|Servlet|service(javax.servlet.ServletRequest,javax.servlet.ServletResponse)}}''' method defined by the {{Javadoc:EE|javax/servlet|Servlet}} [[Interface (Java)|interface]] to provide the code for the service request handler. The service() method is passed a '''{{Javadoc:EE|javax/servlet|ServletRequest}}''' object that contains the request from the client and a '''{{Javadoc:EE|javax/servlet|ServletResponse}}''' object used to create the response returned to the client. The service() method declares that it '''throws''' the [[exception handling|exceptions]] {{Javadoc:EE|javax/servlet|ServletException}} and {{Javadoc:SE|java/io|IOException}} if a problem prevents it from responding to the request.
The '''{{Javadoc:EE|name=setContentType(String)|javax/servlet|ServletResponse|setContentType(java.lang.String)}}''' method in the response object is called to set the [[MIME]] content type of the returned data to '''"text/html"'''. The '''{{Javadoc:EE|name=getWriter()|javax/servlet|ServletResponse|getWriter()}}''' method in the response returns a '''{{Javadoc:SE|java/io|PrintWriter}}''' object that is used to write the data that is sent to the client. The '''{{Javadoc:SE|name=println(String)|java/io|PrintWriter|println(java.lang.String)}}''' method is called to write the '''"Hello, world!"''' string to the response and then the '''{{Javadoc:SE|name=close()|java/io|PrintWriter|close()}}''' method is called to close the print writer, which causes the data that has been written to the stream to be returned to the client.
=== JavaServer Page ===
JavaServer Pages (JSPs) are [[server-side]] Java EE components that generate responses, typically [[HTML]] pages, to [[HTTP]] requests from [[client (computing)|client]]s. JSPs embed Java code in an HTML page by using the special [[delimiter]]s <% and %>. A JSP is compiled to a Java ''servlet'', a Java application in its own right, the first time it is accessed. After that, the generated servlet creates the response.
=== Swing application ===
Swing is a graphical user interface [[library (computer science)|library]] for the Java SE platform. This example Swing application creates a single window with "Hello, world!" inside:
The first '''import''' statement directs the Java compiler to include the {{Javadoc:SE|java/awt|BorderLayout}} class from the {{Javadoc:SE|package=java.awt|java/awt}} package in the compilation; the second '''import''' includes all of the public classes and interfaces from the '''{{Javadoc:SE|package=javax.swing|javax/swing}}''' package.
The '''Hello''' class '''extends''' the '''{{Javadoc:SE|javax/swing|JFrame}}''' class; the JFrame class implements a [[window (computing)|window]] with a [[title bar]] and a close [[Widget (computing)|control]].
The '''Hello()''' [[constructor (computer science)|constructor]] initializes the frame by first calling the superclass constructor, passing the parameter "hello", which is used as the window's title. It then calls the '''{{Javadoc:SE|name=setDefaultCloseOperation(int)|javax/swing|JFrame|setDefaultCloseOperation(int)}}''' method inherited from JFrame to set the default operation when the close control on the title bar is selected to '''{{Javadoc:SE|javax/swing|WindowConstants|EXIT_ON_CLOSE}}''' — this causes the JFrame to be disposed of when the frame is closed (as opposed to merely hidden), which allows the JVM to exit and the program to terminate. Next, the [[Layout manager|layout]] of the frame is set to a BorderLayout; this tells Swing how to arrange the components that will be added to the frame. A '''{{Javadoc:SE|javax/swing|JLabel}}''' is created for the string '''"Hello, world!"''' and the '''{{Javadoc:SE|name=add(Component)|java/awt|Container|add(java.awt.Component)}}''' method inherited from the {{Javadoc:SE|java/awt|Container}} superclass is called to add the label to the frame. The '''{{Javadoc:SE|name=pack()|java/awt|Window|pack()}}''' method inherited from the {{Javadoc:SE|java/awt|Window}} superclass is called to size the window and lay out its contents, in the manner indicated by the BorderLayout.
The '''main()''' method is called by the JVM when the program starts. It [[Instance (programming)|instantiates]] a new '''Hello''' frame and causes it to be displayed by calling the '''{{Javadoc:SE|name=setVisible(boolean)|java/awt|Component|setVisible(boolean)}}''' method inherited from the {{Javadoc:SE|java/awt|Component}} superclass with the boolean parameter '''true'''. Note that once the frame is displayed, exiting the main method does not cause the program to terminate because the AWT [[event dispatching thread]] remains active until all of the Swing top-level windows have been disposed.
== Criticism ==
[[Java performance|Java's performance]] has improved substantially since the early versions, and performance of [[JIT compiler]]s relative to native compilers has in some tests been shown to be quite similar. The performance of the compilers does not necessarily indicate the performance of the compiled code; only careful testing can reveal the true performance issues in any system.
The default [[look and feel]] of [[Graphical User Interface|GUI]] applications written in Java using the [[Swing (Java)|Swing]] toolkit is very different from native applications. It is possible to specify a different look and feel through the [[pluggable look and feel]] system of Swing. Clones of [[Microsoft Windows|Windows]], [[GTK]] and [[Motif (widget toolkit)|Motif]] are supplied by Sun. [[Apple Computer|Apple]] also provides an [[Aqua (theme)|Aqua]] look and feel for [[Mac OS X]]. Though prior implementations of these looks and feels have been considered lacking, Swing in Java SE 6 addresses this problem by using more native [[Widget (computing)|widget]] drawing routines of the underlying platforms. Alternatively, third party toolkits such as [[wx4j]], [[Qt (toolkit)|Qt Jambi]] or [[Standard Widget Toolkit|SWT]] may be used for increased integration with the native windowing system.
As in C++ and some other object-oriented languages, variables of Java's [[primitive type]]s were not originally objects. Values of primitive types are either stored directly in fields (for objects) or on the [[Stack-based memory allocation|stack]] (for methods) rather than on the heap, as is the common case for objects (but see [[Escape analysis]]). This was a conscious decision by Java's designers for performance reasons. Because of this, Java was not considered to be a pure object-oriented programming language. However, as of Java 5.0, [[Object type|autoboxing]] enables programmers to write as if primitive types are their wrapper classes, with their object-oriented counterparts representing classes of their own, and freely interchange between them for improved flexibility.
Java suppresses several features (such as [[operator overloading]] and [[multiple inheritance]]) for ''classes'' in order to simplify the language, to "save the programmers from themselves", and to prevent possible errors and anti-pattern design. This has been a source of criticism, relating to a lack of low-level features, but some of these limitations may be worked around. Java ''interfaces'' have always had multiple inheritance.
== Resources ==
=== Java Runtime Environment ===
The Java Runtime Environment, or ''JRE'', is the software required to run any [[Application software|application]] deployed on the Java Platform. [[End-user]]s commonly use a JRE in [[Software package (programming)|software package]]s and Web browser [[plugin]]s. Sun also distributes a superset of the JRE called the Java 2 [[SDK]] (more commonly known as the JDK), which includes development tools such as the [[Java compiler]], [[Javadoc]], [[JAR (file format)|Jar]] and [[debugger]].
One of the unique advantages of the concept of a runtime engine is that errors (exceptions) should not 'crash' the system. Moreover, in runtime engine environments such as Java there exist tools that attach to the runtime engine and every time that an exception of interest occurs they record debugging information that existed in memory at the time the exception was thrown (stack and heap values). These [[Automated Exception Handling]] tools provide 'root-cause' information for exceptions in Java programs that run in production, testing or development environments.
==== Components ====
* Java [[Library (computer science)|libraries]] are the compiled [[byte code]]s of [[source code]] developed by the JRE implementor to support application development in Java. Examples of these libraries are:
** The core libraries, which include:
*** Collection libraries that implement [[data structure]]s such as [[List (computing)|lists]], [[associative array|dictionaries]], [[tree structure|trees]] and [[Set (computer science)|sets]]
*** [[XML]] Processing (Parsing, Transforming, Validating) libraries
*** Security
*** [[i18n|Internationalization and localization]] libraries
** The integration libraries, which allow the application writer to communicate with external systems. These libraries include:
*** The [[Java Database Connectivity]] (JDBC) [[Application Programming Interface|API]] for database access
*** [[Java Naming and Directory Interface]] (JNDI) for lookup and discovery
*** [[Java remote method invocation|RMI]] and [[CORBA]] for distributed application development
** [[User Interface]] libraries, which include:
*** The (heavyweight, or [[native mode|native]]) [[Abstract Windowing Toolkit]] (AWT), which provides [[graphical user interface|GUI]] components, the means for laying out those components and the means for handling events from those components
*** The (lightweight) [[Swing (Java)|Swing]] libraries, which are built on AWT but provide (non-native) implementations of the AWT widgetry
*** APIs for audio capture, processing, and playback
* A platform dependent implementation of [[Java virtual machine]] (JVM) that is the means by which the byte codes of the Java libraries and third party applications are executed
* Plugins, which enable [[Java applet|applet]]s to be run in [[Web browser]]s
* [[Java Web Start]], which allows Java applications to be efficiently distributed to [[end user]]s across the [[Internet]]
* Licensing and documentation
=== APIs ===
Sun has defined three platforms targeting different application environments and segmented many of its [[application programming interface|API]]s so that they belong to one of the platforms. The platforms are:
* [[Java Platform, Micro Edition]] (Java ME) — targeting environments with limited resources,
* [[Java Platform, Standard Edition]] (Java SE) — targeting workstation environments, and
* [[Java Platform, Enterprise Edition]] (Java EE) — targeting large distributed enterprise or Internet environments.
The [[Class (computer science)|classes]] in the Java APIs are organized into separate groups called [[Java package|packages]]. Each package contains a set of related [[Interface (Java)|interface]]s, classes and [[exception handling|exceptions]]. Refer to the separate platforms for a description of the packages available.
The set of APIs is controlled by Sun Microsystems in cooperation with others through the [[Java Community Process]] program. Companies or individuals participating in this process can influence the design and development of the APIs. This process has been a subject of controversy.
Language
A '''language''' is a dynamic set of visual, auditory, or tactile [[symbol]]s of [[communication]] and the elements used to manipulate them. ''Language'' can also refer to the use of such systems as a general [[phenomenon]]. Language is considered to be an exclusively human mode of communication; although other animals make use of quite sophisticated communicative systems, none of these are known to make use of all of the properties that linguists use to define language.
== Properties of language ==
A set of agreed-upon symbols is only one feature of language; all languages must define the structural relationships between these symbols in a system of [[grammar]]. Rules of grammar are what distinguish language from other forms of communication. They allow a finite set of symbols to be manipulated to create a potentially infinite number of grammatical utterances.
Another property of language is that its symbols are [[arbitrary]]. Any concept or grammatical rule can be mapped onto a symbol. Most languages make use of sound, but the combinations of sounds used do not have any ''inherent'' meaning – they are merely an agreed-upon convention to represent a certain thing by users of that language. For instance, there is nothing about the [[Spanish language|Spanish]] [[word]] ''{{lang|es|nada}}'' itself that forces Spanish speakers to convey the idea of "nothing". Another set of sounds (for example, the English word ''nothing'') could equally be used to represent the same concept, but all Spanish speakers have acquired or learned to correlate this meaning for this particular sound pattern. For [[Slovene language|Slovenian]], [[Croatian language|Croatian]], [[Serbian language|Serbian/Kosovan]] or [[Bosnian language|Bosnian]] speakers on the other hand, ''{{lang|hr|nada}}'' means something else; it means "hope".
==The study of language==
===Linguistics===
[[Linguistics]] is the [[science|scientific]] and [[philosophy|philosophical]] study of language, encompassing a number of sub-fields. At the core of [[theoretical linguistics]] are the study of language structure ([[grammar]]) and the study of meaning ([[semantics]]). The first of these encompasses [[morphology (linguistics)|morphology]] (the formation and composition of [[word]]s), [[syntax]] (the rules that determine how words combine into [[phrase]]s and [[Sentence (linguistics)|sentences]]) and [[phonology]] (the study of sound systems and abstract sound units). [[Phonetics]] is a related branch of linguistics concerned with the actual properties of speech sounds ([[phone]]s), non-speech sounds, and how they are produced and [[speech perception|perceived]]. [[Theoretical linguistics]] is mostly concerned with developing models of linguistic knowledge. The fields that are generally considered as the core of theoretical linguistics are [[syntax]], [[phonology]], [[Morphology (linguistics)|morphology]], and [[semantics]]. [[Applied linguistics]] attempts to put linguistic theories into practice through areas like [[translation]], [[Stylistics (linguistics)|stylistics]], [[literary criticism]] and [[Literary theory|theory]], [[discourse analysis]], [[speech therapy]], speech pathology and [[Second language acquisition|foreign language teaching]].
===History===
The historical record of [[linguistics]] begins in [[India]] with [[Pāṇini]], the [[5th century BCE]] grammarian who formulated 3,959 rules of [[Sanskrit language|Sanskrit]] [[morphology (linguistics)|morphology]], known as the ''{{IAST|[[Aṣṭādhyāyī]]}}'' (अष्टाध्यायी) and with [[Tolkāppiyar]], the [[3rd century BCE]] grammarian of the [[Tamil language|Tamil]] work [[Tolkāppiyam]]. grammar is highly systematized and technical. Inherent in its analytic approach are the concepts of the [[phoneme]], the [[morpheme]], and the [[Root (linguistics)|root]]; Western linguists only recognized the phoneme some two millennia later. Tolkāppiyar's work is perhaps the first to describe [[articulatory phonetics]] for a language. Its classification of the alphabet into [[consonant]]s and [[vowel]]s, and elements like nouns, verbs, vowels, and consonants, which he put into classes, were also breakthroughs at the time.
In the [[Middle East]], the [[Persian Empire|Persian]] linguist [[Sibawayh]] (سیبویه) made a detailed and professional description of [[Arabic language|Arabic]] in 760 CE in his monumental work, ''Al-kitab fi al-nahw'' (الكتاب في النحو, ''The Book on Grammar''), bringing many [[Linguistics|linguistic]] aspects of language to light. In his book, he distinguished [[phonetics]] from [[phonology]].
Later in the West, the success of [[science]], [[mathematics]], and other [[formal system]]s in the 20th century led many to attempt a formalization of the study of language as a "semantic code". This resulted in the [[academic discipline]] of [[linguistics]], the founding of which is attributed to [[Ferdinand de Saussure]]. In the 20th century, substantial contributions to the understanding of language came from [[Ferdinand de Saussure]], [[Hjelmslev]], [[Émile Benveniste]] and [[Roman Jakobson]], which are characterized as being highly [[systematic]].
== Human languages ==
Human languages are usually referred to as natural languages, and the science of studying them falls under the purview of [[linguistics]]. A common progression for natural languages is that they are considered to be first spoken, then written, and then an understanding and explanation of their grammar is attempted.
Languages live, die, move from place to place, and change with time. Any language that ceases to change or develop is categorized as a [[dead language]]. Conversely, any language that is a ''living language,'' that is, it is in a continuous state of change, is known as a [[modern language]].
Making a principled distinction between one language and another is usually impossible. For instance, there are a few [[dialect]]s of [[German language|German]] similar to some dialects of [[Dutch language|Dutch]]. The transition between languages within the same [[language family]] is sometimes gradual (see [[dialect continuum]]).
Some like to make parallels with [[biology]], where it is not possible to make a well-defined distinction between one species and the next. In either case, the ultimate difficulty may stem from the [[interaction]]s between languages and [[population]]s. (See [[Dialect]] or [[August Schleicher]] for a longer discussion.)
The concepts of [[Ausbausprache - Abstandsprache - Dachsprache|Ausbausprache, Abstandsprache and Dachsprache]] are used to make finer distinctions about the degrees of difference between languages or dialects.
==Artificial languages==
=== Constructed languages ===
Some individuals and groups have constructed their own artificial languages, for practical, experimental, personal, or ideological reasons. International auxiliary languages are generally constructed languages that strive to be easier to learn than natural languages; other constructed languages strive to be more logical ("loglangs") than natural languages; a prominent example of this is [[Lojban]].
Some writers, such as [[J. R. R. Tolkien]], have created fantasy languages, for literary, [[Artistic language|artistic]] or personal reasons. The fantasy language of the [[Klingon]] race has in recent years been developed by fans of the Star Trek series, including a vocabulary and grammar.
Constructed languages are not necessarily restricted to the properties shared by natural languages.
This part of ISO 639 also includes identifiers that denote constructed (or artificial) languages. In order to qualify for inclusion the language must have a literature and it must be designed for the purpose of human communication. Specifically excluded are reconstructed languages and computer programming languages.
===International auxiliary languages===
Some languages, most constructed, are meant specifically for communication between people of different nationalities or language groups as an easy-to-learn second language. Several of these languages have been constructed by individuals or groups. Natural, pre-existing languages may also be used in this way - their developers merely catalogued and standardized their vocabulary and identified their grammatical rules. These languages are called ''naturalistic.'' One such language, [[Latino Sine Flexione]], is a simplified form of Latin. Two others, [[Occidental language|Occidental]] and [[Novial]], were drawn from several Western languages.
To date, the most successful auxiliary language is [[Esperanto]], invented by Polish ophthalmologist [[L. L. Zamenhof|Zamenhof]]. It has a relatively large community roughly estimated at about 2 million speakers worldwide, with a large body of literature, songs, and is the only known constructed language to have [[Native Esperanto speakers|native speakers]], such as the Hungarian-born American businessman [[George Soros]]. Other auxiliary languages with a relatively large number of speakers and literature are [[Interlingua]] and [[Ido]].
===Controlled languages===
Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity. The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language. An example of a widely used controlled natural language is [[Simplified English]], which was originally developed for [[aerospace]] industry maintenance manuals.
== Formal languages ==
[[Mathematics]] and [[computer science]] use artificial entities called formal languages (including [[programming language]]s and [[markup language]]s, and some that are more theoretical in nature). These often take the form of [[character string]]s, produced by a combination of [[formal grammar]] and semantics of arbitrary complexity.
=== Programming languages ===
A programming language is an extreme case of a formal language that can be used to control the behavior of a machine, particularly a computer, to perform specific tasks. Programming languages are defined using syntactic and semantic rules, to determine structure and meaning respectively.
Programming languages are used to facilitate communication about the task of organizing and manipulating information, and to express algorithms precisely. Some authors restrict the term "programming language" to those languages that can express all possible algorithms; sometimes the term "computer language" is used for artificial languages that are more limited.
== Animal communication ==
The term "[[animal language]]s" is often used for non-human languages. Linguists do not consider these to be "language", but describe them as [[animal communication]], because the interaction between animals in such communication is fundamentally different in its underlying principles from human language. Nevertheless, some scholars have tried to disprove this mainstream premise through experiments on training chimpanzees to talk. [[Karl von Frisch]] received the Nobel Prize in 1973 for his proof of the language and dialects of the bees.
In several publicized instances, non-human animals have been taught to understand certain features of human language. [[Chimpanzee]]s, [[gorilla]]s, and [[orangutan]]s have been taught hand signs based on [[American Sign Language]]. The [[African Grey Parrot]], which possesses the ability to mimic human speech with a high degree of accuracy, is suspected of having sufficient intelligence to comprehend some of the speech it mimics. Most species of [[parrot]], despite expert mimicry, are believed to have no linguistic comprehension at all.
While proponents of animal communication systems have debated levels of [[semantics]], these systems have not been found to have anything approaching human language [[syntax]].
Language model
A statistical '''language model''' assigns a [[probability]] to a sequence of ''m'' words by means of a [[probability distribution]].
Language modeling is used in many [[natural language processing]] applications such as [[speech recognition]], [[machine translation]], [[part-of-speech tagging]], [[parsing]] and [[information retrieval]].
In [[speech recognition]] and in [[data compression]], such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a [[document]] in a collection. With query ''Q'' as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, ''P(Q|Md)''.
Estimating the probability of sequences can become difficult in [[corpora]], in which [[phrase]]s or [[Sentence (linguistics)|sentence]]s can be arbitrarily long and hence some sequences are not observed during [[training]] of the language model ([[data sparseness problem]] of [[overfitting]]). For that reason these models are often approximated using smoothed [[N-gram]] models.
== N-gram models ==
In an n-gram model, the probability of observing the sentence w1,...,wm is approximated as
Here, it is assumed that the probability of observing the ''ith'' word ''wi'' in the context history of the preceding ''i-1'' words can be approximated by the probability of observing it in the shortened context history of the preceding ''n-1'' words (''nth order [[Markov property]]).
The conditional probability can be calculated from n-gram frequency counts:
The words '''bigram''' and '''trigram''' language model denote n-gram language models with ''n=2'' and ''n=3'', respectively.
=== Example ===
In a bigram (n=2) language model, the probability of the sentence ''I saw the red house'' is approximated as
whereas in a trigram (n=3) language model, the approximation is
Latent semantic analysis
'''Latent semantic analysis (LSA)''' is a technique in [[natural language processing]], in particular in [[vectorial semantics]], of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
LSA was patented in [[1988]] ([http://patft.uspto.gov/netacgi/nph-Parser?patentnumber=4839853 US Patent 4,839,853]) by [[Scott Deerwester]], [[Susan Dumais]], [[George Furnas]], [[Richard Harshman]], [[Thomas Landauer]], [[Karen Lochbaum]] and [[Lynn Streeter]]. In the context of its application to [[information retrieval]], it is sometimes called '''latent semantic indexing (LSI)'''.
== Occurrence matrix ==
LSA can use a [[term-document matrix]] which describes the occurrences of terms in documents; it is a [[sparse matrix]] whose rows correspond to [[terminology|terms]] and whose columns correspond to documents, typically [[stemming|stemmed]] words that appear in the documents. A typical example of the weighting of the elements of the matrix is [[tf-idf]] (term frequency–inverse document frequency): the element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.
This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used.
LSA transforms the occurrence matrix into a relation between the terms and some ''concepts'', and a relation between those concepts and the documents. Thus the terms and documents are now indirectly related through the concepts.
== Applications ==
The new concept space typically can be used to:
* Compare the documents in the concept space ([[data clustering]], [[document classification]])......
* Find similar documents across languages, after analyzing a base set of translated documents ([[cross language retrieval]]).
* Find relations between terms ([[synonymy]] and [[polysemy]]).
* Given a query of terms, translate it into the concept space, and find matching documents ([[information retrieval]]).
Synonymy and polysemy are fundamental problems in [[natural language processing]]:
* Synonymy is the phenomenon where different words describe the same idea. Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query. For example, a search for "doctors" may not return a document containing the word "physicians", even though the words have the same meaning.
* Polysemy is the phenomenon where the same word has multiple meanings. So a search may retrieve irrelevant documents containing the desired words in the wrong meaning. For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.
== Rank lowering ==
After the construction of the occurrence matrix, LSA finds a low-[[rank (matrix theory)|rank]] approximation to the [[term-document matrix]]. There could be various reasons for these approximations:
* The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an ''approximation'' (a "least and necessary evil").
* The original term-document matrix is presumed ''noisy'': for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a ''de-noisified matrix'' (a better matrix than the original).
* The original term-document matrix is presumed overly [[Sparse matrix|sparse]] relative to the "true" term-document matrix. That is, the original matrix lists only the words actually ''in'' each document, whereas we might be interested in all words ''related to'' each document--generally a much larger set due to [[synonymy]].
The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:
:: {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
This mitigates synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also mitigates polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.
== Derivation ==
Let be a matrix where element describes the occurrence of term in document (this can be, for example, the frequency). will look like this:
:
Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document:
:
Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term:
:
Now the [[dot product]] between two term vectors gives the [[correlation]] between the terms over the documents. The [[matrix product]] contains all these dot products. Element (which is equal to element ) contains the dot product (). Likewise, the matrix contains the dot products between all the document vectors, giving their correlation over the terms: .
Now assume that there exists a decomposition of such that and are [[orthonormal matrix|orthonormal matrices]] and is a [[diagonal matrix]]. This is called a [[singular value decomposition]] (SVD):
:
The matrix products giving us the term and document correlations then become
:
Since and are diagonal we see that must contain the [[eigenvector]]s of , while must be the eigenvectors of . Both products have the same non-zero eigenvalues, given by the non-zero entries of , or equally, by the non-zero entries of . Now the decomposition looks like this:
:
The values are called the singular values, and and the left and right singular vectors.
Notice how the only part of that contributes to is the row.
Let this row vector be called .
Likewise, the only part of that contributes to is the column, .
These are ''not'' the eigenvectors, but ''depend'' on ''all'' the eigenvectors.
It turns out that when you select the largest singular values, and their corresponding singular vectors from and , you get the rank approximation to X with the smallest error ([[Frobenius norm]]). The amazing thing about this approximation is that not only does it have a minimal error, but it translates the term and document vectors into a concept space. The vector then has entries, each giving the occurrence of term in one of the concepts. Likewise, the vector gives the relation between document and each concept. We write this approximation as
:
You can now do the following:
* See how related documents and are in the concept space by comparing the vectors and (typically by [[vector space model|cosine similarity]]). This gives you a clustering of the documents.
* Comparing terms and by comparing the vectors and , giving you a clustering of the terms in the concept space.
* Given a query, view this as a mini document, and compare it to your documents in the concept space.
To do the latter, you must first translate your query into the concept space. It is then intuitive that you must use the same transformation that you use on your documents:
:
:
This means that if you have a query vector , you must do the translation before you compare it with the document vectors in the concept space. You can do the same for pseudo term vectors:
:
:
:
== Implementation ==
The [[Singular Value Decomposition|SVD]] is typically computed using large matrix methods (for example, [[Lanczos method]]s) but may also be computed incrementally and with greatly reduced resources via a [[neural network]]-like approach, which does not require the large, full-rank matrix to be held in memory ([http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf Gorrell and Webb, 2005]).
A fast, incremental, low-memory, large-matrix SVD algorithm has recently been developed ([http://www.merl.com/publications/TR2006-059/ Brand, 2006]). Unlike Gorrell and Webb's (2005) stochastic approximation, Brand's (2006) algorithm provides an exact solution.
== Limitations ==
LSA has two drawbacks:
* The resulting dimensions might be difficult to interpret. For instance, in
:: {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
:the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle". However, it is very likely that cases close to
:: {(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)}
:will occur. This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language.
* The [[probabilistic model]] of LSA does not match observed data: LSA assumes that words and documents form a joint [[normal distribution|Gaussian]] model ([[ergodic hypothesis]]), while a [[Poisson distribution]] has been observed. Thus, a newer alternative is [[probabilistic latent semantic analysis]], based on a [[multinomial distribution|multinomial]] model, which is reported to give better results than standard LSA .
Linguistics
'''Linguistics''' is the [[science|scientific]] study of [[language]], encompassing a number of sub-fields. An important topical division is between the study of language structure ([[grammar]]) and the study of [[Meaning (linguistics)|meaning]] ([[semantics]]). Grammar encompasses [[morphology (linguistics)|morphology]] (the formation and composition of [[word]]s), [[syntax]] (the rules that determine how words combine into [[phrase]]s and [[Sentence (linguistics)|sentences]]) and [[phonology]] (the study of sound systems and abstract sound units). [[Phonetics]] is a related branch of linguistics concerned with the actual properties of speech sounds ([[phone]]s), non-speech sounds, and how they are produced and [[speech perception|perceived]].
Over the twentieth century, following the work of [[Noam Chomsky]], linguistics came to be dominated by the [[Generative grammar|Generativist school]], which is chiefly concerned with explaining how human beings [[language acquisition|acquire language]] and the biological constraints on this acquisition; generative theory is [[Language module|modularist]] in character. While this remains the dominant paradigm, other linguistic theories have increasingly gained in popularity — [[cognitive linguistics]] being a prominent example. There are many sub-fields in linguistics, which may or may not be dominated by a particular theoretical approach: [[evolutionary linguistics]], for example, attempts to account for the origins of language; [[historical linguistics]] explores language change; and [[sociolinguistics]] looks at the relation between linguistic variation and social structures.
A variety of intellectual disciplines are relevant to the study of language. Although certain linguists have downplayed the relevance of some other fields, linguistics — like other sciences — is highly interdisciplinary and draws on work from such fields as [[psychology]], [[informatics]], [[computer science]], [[philosophy]], [[biology]], [[human anatomy]], [[neuroscience]], [[sociology]], [[anthropology]], and [[acoustics]].
==Names for the discipline==
Before the twentieth century (the word is first attested 1716), the term "[[philology]]" was commonly used to refer to the science of language, which was then predominately historical in focus. Since [[Ferdinand de Saussure]]'s insistence on the importance of [[Synchronic analysis (linguistics)|synchronic analysis]], however, this focus has shifted and the term "philology" is now generally used for the "study of a language's grammar, history and literary tradition", especially in the [[USA]]., where it was never as popular as elsewhere in the sense "science of language". The term "linguistics" dates from 1847, although "linguist" in the sense a student of language" dates from 1641. It is now the usual academic term in English for the scientific study of language.
==Fundamental concerns and divisions==
Linguistics concerns itself with describing and explaining the nature of human language. Relevant to this are the questions of what is universal to language, how language can vary, and how human beings come to know languages. All humans (setting aside extremely pathological cases) achieve competence in whatever language is spoken (or signed, in the case of [[sign language|signed languages]]) around them when growing up, with apparently little need for explicit conscious instruction. While non-humans acquire their own communication systems, they do not acquire human language in this way (although many non-human animals can learn to respond to language, or can even be trained to use it to a degree). Therefore, linguists assume, the ability to acquire and use language is an innate, biologically-based potential of modern human beings, similar to the ability to walk. There is no consensus, however, as to the extent of this innate potential, or its domain-specificity (the degree to which such innate abilities are specific to language), with some theorists claiming that there is a very large set of highly abstract and specific binary settings coded into the human brain, while others claim that the ability to learn language is a product of general human cognition. It is, however, generally agreed that there are no strong ''genetic'' differences underlying the differences between languages: an individual will acquire whatever language(s) they are exposed to as a child, regardless of parentage or ethnic origin.
Linguistic structures are pairings of meaning and form (which may consist of sound patterns, movements of the hand, written symbols, and so on); such pairings are known as [[Ferdinand de Saussure|Saussurean]] [[linguistic sign|signs]]. Linguists may specialize in some sub-area of linguistic structure, which can be arranged in the following terms, from form to meaning:
* '''[[Phonetics]]''', the study of the physical properties of speech (or signed) production and perception
* '''[[Phonology]]''', the study of sounds (adjusted appropriately for signed languages) as discrete, abstract elements in the speaker's mind that distinguish meaning
* '''[[Morphology (linguistics)|Morphology]]''', the study of internal structures of [[word]]s and how they can be modified
* '''[[Syntax]]''', the study of how words combine to form grammatical [[sentence]]s
* '''[[Semantics]]''', the study of the meaning of words ([[lexical semantics]]) and fixed word combinations ([[phraseology]]), and how these combine to form the [[meaning]]s of sentences
* '''[[Pragmatics]]''', the study of how [[utterance]]s are used (literally, figuratively, or otherwise) in [[speech acts|communicative acts]]
* '''[[Discourse analysis]]''', the analysis of language use in [[texts]] (spoken, written, or signed)
Many linguists would agree that these divisions overlap considerably, and the independent significance of each of these areas is not universally acknowledged. Regardless of any particular linguist's position, each area has core concepts that foster significant scholarly inquiry and research.
Intersecting with these domains are fields arranged around the kind of external factors that are considered. For example
* [[Linguistic typology]], the study of the common properties of diverse unrelated languages, properties that may, given sufficient attestation, be assumed to be innate to human language capacity.
* [[Stylistics (linguistics)|Stylistics]], the study of linguistic factors that place a discourse in context.
* [[Developmental linguistics]], the study of the development of linguistic ability in an individual, particularly [[Language acquisition|the acquisition of language]] in childhood.
* [[Historical linguistics]] or Diachronic linguistics, the study of language change.
* [[Language geography]], the study of the spatial patterns of languages.
* [[Evolutionary linguistics]], the study of the origin and subsequent development of language.
* [[Psycholinguistics]], the study of the cognitive processes and representations underlying language use.
* [[Sociolinguistics]], the study of social patterns and norms of linguistic variability.
* [[Clinical linguistics]], the application of linguistic theory to the area of [[Speech-Language Pathology]].
* [[Neurolinguistics]], the study of the brain networks that underlie grammar and communication.
* [[Biolinguistics]], the study of natural as well as human-taught communication systems in animals compared to human language.
* [[Computational linguistics]], the study of computational implementations of linguistic structures.
* [[Applied linguistics]], the study of language related issues applied in everyday life, notably language. policies, planning, and education. [[Constructed language]] fits under Applied linguistics.
The related discipline of [[semiotics]] investigates the relationship between signs and what they signify. From the perspective of semiotics, language can be seen as a sign or symbol, with the world as its representation.
==Variation and universality==
Much modern linguistic research, particularly within the [[paradigm]] of [[generative grammar]], has concerned itself with trying to account for differences between languages of the world. This has worked on the assumption that if human linguistic ability is narrowly constrained by human biology, then all languages must share certain fundamental properties.
In [[generative grammar|generativist theory]], the collection of fundamental properties all languages share are referred to as [[universal grammar]] (UG). The specific characteristics of this universal grammar are a much debated topic. [[Linguistic typology|Typologists]] and non-generativist linguists usually refer simply to [[linguistic universal|language universals]], or ''universals of language''.
Similarities between languages can have a number of different origins. In the simplest case, universal properties may be due to universal aspects of human experience. For example, all humans experience water, and all human languages have a word for water. Other similarities may be due to common descent: the [[Latin language]] spoken by the [[Ancient Rome|Ancient Romans]] developed into Spanish in Spain and Italian in Italy; similarities between Spanish and Italian are thus in many cases due to both being descended from Latin. In other cases, [[Language contact|contact between languages]] — particularly where many speakers are bilingual — can lead to much borrowing of structures, as well as words. Similarity may also, of course, be due to coincidence. English ''much'' and Spanish ''mucho'' are not descended from the same form or borrowed from one language to the other; nor is the similarity due to innate linguistic knowledge (see [[False cognate]]).
Arguments in favor of language universals have also come from documented cases of [[sign language]]s (such as [[Al-Sayyid Bedouin Sign Language]]) developing in communities of congenitally deaf people, independently of spoken language. The properties of these sign languages conform generally to many of the properties of spoken languages. Other known and suspected sign language [[language isolate|isolates]] include [[Kata Kolok]], [[Nicaraguan Sign Language]], and [[Providence Island Sign Language]].
== Structures ==
It has been perceived that languages tend to be organized around [[grammatical categories]] such as noun and verb, [[nominative case|nominative]] and [[accusative case|accusative]], or present and past, though, importantly, not exclusively so. The grammar of a language is organized around such fundamental categories, though many languages express the relationships between words and syntax in other discrete ways (cf. some Bantu languages for noun/verb relations, ergative/absolutive systems for case relations, several Native American languages for tense/aspect relations).
In addition to making substantial use of discrete categories, language has the important property that it organizes elements into recursive structures; this allows, for example, a noun phrase to contain another noun phrase (as in “the chimpanzee’s lips”) or a clause to contain a clause (as in “I think that it’s raining”). Though recursion in grammar was implicitly recognized much earlier (for example by [[Otto Jespersen|Jespersen]]), the importance of this aspect of language became more popular after the 1957 publication of [[Noam Chomsky]]’s book “[[Syntactic Structures]]”, - that presented a formal grammar of a fragment of English. Prior to this, the most detailed descriptions of linguistic systems were of phonological or morphological systems.
Chomsky used a [[context-free grammar]] augmented with transformations. Since then, following the trend of Chomskyan linguistics, context-free grammars have been written for substantial fragments of various languages (for example [[Generalised phrase structure grammar|GPSG]], for English), but it has been demonstrated that human languages include cross-serial dependencies, which cannot be handled adequately by context-free grammars.
==Some selected sub-fields ==
'''Diachronic linguistics'''
Studying languages at a particular point in time (usually the present) is "synchronic", while diachronic linguistics examines how language changes through time, sometimes over centuries. It enjoys both a rich history and a strong theoretical foundation for the study of [[language change]].
In universities in the United States, the non-historic perspective is often out of fashion. The shift in focus to a non-historic perspective started with [[Ferdinand de Saussure|Saussure]] and became pre-dominant with [[Noam Chomsky]].
Explicitly historical perspectives include [[historical-comparative linguistics]] and [[etymology]].
'''Contextual linguistics'''
Contextual linguistics may include the study of linguistics in interaction with other academic disciplines. The interdisciplinary areas of linguistics consider how language interacts with the rest of the world. [[Sociolinguistics]], [[anthropological linguistics]], and [[linguistic anthropology]] are seen as areas that bridge the gap between linguistics and society as a whole. [[Psycholinguistics]] and [[neurolinguistics]] relate linguistics to the [[medical science]]s.
Other cross-disciplinary areas of linguistics include [[evolutionary linguistics]], [[computational linguistics]] and [[cognitive science]].
'''Applied linguistics'''
Linguists are largely concerned with finding and [[descriptive linguistics|describing]] the generalities and varieties both within particular languages and among all language. [[Applied linguistics]] takes the result of those findings and “applies” them to other areas. Often “applied linguistics” refers to the use of linguistic research in language teaching, but results of linguistic research are used in many other areas, as well.
Today in the age of information technology, many areas of applied linguistics attempt to involve the use of computers. [[Speech synthesis]] and [[speech recognition]] use phonetic and phonemic knowledge to provide voice interfaces to computers. Applications of [[computational linguistics]] in [[machine translation]], [[computer-assisted translation]], and [[natural language processing]] are areas of applied linguistics which have come to the forefront. Their influence has had an effect on theories of syntax and semantics, as modeling syntactic and semantic theories on computers constraints.
==Description and prescription==
''Main articles: [[Descriptive linguistics]], [[Linguistic prescription]]''
Linguistics is '''descriptive'''; linguists describe and explain features of language without making subjective judgments on whether a particular feature is "right" or "wrong". This is analogous to practice in other sciences: a [[zoologist]] studies the animal kingdom without making subjective judgments on whether a particular animal is better or worse than another.
'''Prescription''', on the other hand, is an attempt to promote particular linguistic usages over others, often favouring a particular dialect or "[[acrolect]]". This may have the aim of establishing a [[Standard language|linguistic standard]], which can aid communication over large geographical areas. It may also, however, be an attempt by speakers of one language or dialect to exert influence over speakers of other languages or dialects (see [[Linguistic imperialism]]). An extreme version of prescriptivism can be found among [[censorship|censors]], who attempt to eradicate words and structures which they consider to be destructive to society.
== Speech and writing ==
Most contemporary linguists work under the assumption that [[spoken language|spoken]] (or signed) language is more fundamental than [[written language]]. This is because:
* Speech appears to be a human "universal", whereas there have been many [[culture]]s and speech communities that lack written communication;
* Speech evolved before human beings discovered writing;
* People learn to speak and process spoken languages more easily and much earlier than writing;
Linguists nonetheless agree that the study of written language can be worthwhile and valuable. For research that relies on [[corpus linguistics]] and [[computational linguistics]], written language is often much more convenient for processing large amounts of linguistic data. Large corpora of spoken language are difficult to create and hard to find, and are typically [[transcription (linguistics)|transcribed]] and written. Additionally, linguists have turned to text-based discourse occurring in various formats of [[computer-mediated communication]] as a viable site for linguistic inquiry.
The study of [[writing systems]] themselves is in any case considered a branch of linguistics.
== History ==
Some of the earliest linguistic activities can be recalled from [[Iron Age India]] with the analysis of [[Sanskrit]]. The [[Pratishakhya]]s (from ca. the 8th century BC) constitute as it were a proto-linguistic ''ad hoc'' collection of observations about mutations to a given [[corpus linguistics|corpus]] particular to a given [[Shakha|Vedic school]]. Systematic study of these texts gives rise to the [[Vedanga]] discipline of [[Vyakarana]], the earliest surviving account of which is the work of {{IAST|[[Pānini]]}} (c. 520 – 460 BC), who, however, looks back on what are probably several generations of grammarians, whose opinions he occasionally refers to. {{IAST|Pānini}} formulates close to 4,000 rules which together form a compact [[generative grammar]] of Sanskrit. Inherent in his analytic approach are the concepts of the [[phoneme]], the [[morpheme]] and the [[root]]. Due to its focus on brevity, his grammar has a highly unintuitive structure, reminiscent of contemporary "machine language" (as opposed to "human readable" programming languages).
Indian linguistics maintained a high level for several centuries; [[Mahābhāṣya|Patanjali]] in the 2nd century BC still actively criticizes Panini. In the later centuries BC, however, Panini's grammar came to be seen as prescriptive, and commentators came to be fully dependent on it. [[Bhartrihari]] (c. 450 – 510) theorized the act of speech as being made up of four stages: first, conceptualization of an idea, second, its verbalization and sequencing (articulation) and third, delivery of speech into atmospheric air, the interpretation of speech by the listener, the interpreter.
In the [[Middle East]], the [[Persian language|Persian]] linguist [[Sibawayh]] made a detailed and professional description of [[Arabic language|Arabic]] in 760, in his monumental work, ''Al-kitab fi al-nahw'' (الكتاب في النحو, ''The Book on Grammar''), bringing many linguistic aspects of language to light. In his book he distinguished [[phonetics]] from [[phonology]].
Western linguistics begins in Classical Antiquity with grammatical speculation such as [[Plato]]'s ''[[Cratylus]]''. [[William Jones (philologist)|Sir William Jones]] noted that [[Sanskrit]] shared many common features with classical [[Latin]] and [[Ancient Greek|Greek]], notably verb roots and grammatical structures, such as the [[case system]]. This led to the theory that all languages sprung from a common source and to the discovery of the [[Indo-European]] [[language family]]. He began the study of [[comparative linguistics]], which would uncover more language families and branches.
Some early-19th-century linguists were [[Jakob Grimm]], who devised a principle of consonantal shifts in pronunciation – known as [[Grimm's Law]] – in 1822; [[Karl Verner]], who formulated [[Verner's Law]]; [[August Schleicher]], who created the "Stammbaumtheorie" ("family tree"); and [[Johannes Schmidt (linguist)|Johannes Schmidt]], who developed the "Wellentheorie" ("wave model") in 1872. [[Ferdinand de Saussure]] was the founder of modern structural linguistics. [[Edward Sapir]], a leader in American structural linguistics, was one of the first who explored the relations between language studies and anthropology. His methodology had strong influence on all his successors. [[Noam Chomsky|Noam Chomsky's]] formal model of language, [[transformational-generative grammar]], developed under the influence of his teacher [[Zellig Harris]], who was in turn strongly influenced by [[Leonard Bloomfield]], has been the dominant model since the 1960s.
[[Noam Chomsky]] remains a pop-linguistic figure. Linguists (working in frameworks such as [[Head-Driven Phrase Structure Grammar]] (HPSG) or [[Lexical Functional Grammar]] (LFG)) are increasingly seen to stress the importance of formalization and formal rigor in linguistic description, and may distance themselves somewhat from Chomsky's more recent work (the "Minimalist" program for [[Transformational grammar]]), connecting more closely to his earlier works.
Other linguists working in [[Optimality Theory]] state generalizations in terms of violable constraints that interact with each other, and abandon the traditional rule-based formalism first pioneered by early work in generativist linguistics. Functionalist linguists working in [[functional grammar]] and [[Cognitive Linguistics]] tend to stress the non-autonomy of linguistic knowledge and the non-universality of linguistic structures, thus differing significantly from the Chomskyan school. They reject Chomskyan intuitive introspection as a scientific method, relying instead on typological evidence.
Linux
'''Linux''' (commonly pronounced {{IPAEng|ˈlɪnəks}} in English; variants exist) is a [[Unix-like]] computer [[operating system]]. Linux is one of the most prominent examples of [[free software]] and [[open source]] development: typically all underlying [[source code]] can be freely modified, used, and redistributed by anyone.
The name "Linux" comes from the [[Linux kernel]], originally written in 1991 by [[Linus Torvalds]]. The system's [[system utility|utilities]] and [[library (computer science)|libraries]] usually come from the [[GNU operating system]], announced in 1983 by [[Richard Stallman]]. The GNU contribution is the basis for the alternative name '''GNU/Linux'''.
Predominantly known for its use in [[server (computing)|server]]s, Linux is supported by corporations such as [[Dell]], [[Hewlett-Packard]], [[IBM]], [[Novell]], [[Oracle Corporation]], [[Red Hat]], and [[Sun Microsystems]]. It is used as an operating system for a wide variety of computer [[hardware]], including [[desktop computer]]s, [[supercomputers]], video game systems, such as the [[PlayStation 2]] and [[PlayStation 3]], several [[arcade games]], and [[embedded devices]] such as [[mobile phone]]s, [[routers]], and [[stage lighting]] systems.
== History ==
The [[Unix]] operating system was conceived and implemented in the 1960s and first released in 1970. Its wide availability and [[Porting|portability]] meant that it was widely adopted, copied and modified by academic institutions and businesses, with its design being influential on authors of other systems.
The [[GNU Project]], started in 1984, had the goal of creating a "''complete Unix-compatible software system''" made entirely of [[free software]]. In 1985, [[Richard Stallman]] created the [[Free Software Foundation]] and developed the [[GNU General Public License]] (GNU GPL). Many of the programs required in an OS (such as libraries, [[compiler]]s, [[text editor]]s, a [[Unix shell]], and a windowing system) were completed by the early 1990s, although low level elements such as [[device driver]]s, [[daemon (computer software)|daemon]]s, and the [[Kernel (computer science)|kernel]] were stalled and incomplete. Linus Torvalds has said that if the GNU kernel had been available at the time (1991), he would not have decided to write his own.
=== MINIX ===
[[MINIX]], a Unix-like system intended for academic use, was released by [[Andrew S. Tanenbaum]] in 1987. While source code for the system was available, modification and redistribution were restricted (that is not the case today). In addition, MINIX's [[16-bit]] design was not well adapted to the [[32-bit]] design of the increasingly cheap and popular [[Intel 386]] architecture for personal computers.
In 1991, Torvalds began to work on a non-commercial replacement for MINIX while he was attending the [[University of Helsinki]]. This eventually became the [[Linux kernel]].
In 1992, Tanenbaum posted an article on [[Usenet]] claiming Linux was obsolete. In the article, he criticized the operating system as being [[Monolithic kernel|monolithic]] in design and being tied closely to the x86 architecture and thus not portable, as he described "a fundamental error." Tanenbaum suggested that those who wanted a modern operating system should look into one based on the [[microkernel]] model. The posting elicited the response of Torvalds and [[Ken Thompson]], one of the founders of [[Unix]], which resulted in a well known debate over the microkernel and monolithic kernel designs.
Linux was dependent on the MINIX [[user space]] at first. With code from the GNU system freely available, it was advantageous if this could be used with the fledgling OS. Code licensed under the GNU GPL can be used in other projects, so long as they also are released under the same or a compatible license. In order to make the Linux kernel compatible with the components from the GNU Project, Torvalds initiated a switch from his original license (which prohibited commercial redistribution) to the GNU GPL. Linux and GNU developers worked to integrate GNU components with Linux to make a fully functional and free operating system.
=== Commercial and popular uptake ===
Today Linux is used in numerous domains, from [[embedded system]]s to [[supercomputer]]s, and has secured a place in [[server (computing)|server]] installations with the popular [[LAMP (software bundle)|LAMP]] application stack. Torvalds continues to direct the development of the kernel. Stallman heads the Free Software Foundation, which in turn supports the GNU components. Finally, individuals and corporations develop third-party non-GNU components. These third-party components comprise a vast body of work and may include both kernel modules and user applications and libraries. Linux vendors and communities combine and distribute the kernel, GNU components, and non-GNU components, with additional package management software in the form of [[Linux distribution]]s.
== Design ==
Linux is a modular [[Unix-like]] operating system. It derives much of its basic design from principles established in Unix during the 1970s and 1980s. Linux uses a [[monolithic kernel]], the [[Linux kernel]], which handles process control, networking, and [[peripheral]] and [[file system]] access. [[Device drivers]] are integrated directly with the kernel.
Much of Linux's higher-level functionality is provided by separate projects which interface with the kernel. The GNU [[Userland (computing)|userland]] is an important part of most Linux systems, providing the [[shell (computing)|shell]] and [[Unix tool]]s which carry out many basic operating system tasks. On top these tools form a Linux system with a [[graphical user interface]] that can be used, usually running in the [[X Window System]].
=== User interface ===
Linux can be controlled by one or more of a text-based [[command line interface]] (CLI), [[graphical user interface]] (GUI) (usually the default for desktop), or through controls on the device itself (common on embedded machines).
On desktop machines, [[KDE]], [[GNOME]] and [[Xfce]] are the most popular user interfaces, though a variety of other user interfaces exist. Most popular user interfaces run on top of the [[X Window System]] (X), which provides [[network transparency]], enabling a graphical application running on one machine to be displayed and controlled from another.
Other GUIs include [[X window manager]]s such as [[FVWM]], [[Enlightenment (window manager)|Enlightenment]] and [[Window Maker]]. The window manager provides a means to control the placement and appearance of individual application windows, and interacts with the X window system.
A Linux system usually provides a [[CLI]] of some sort through a [[Shell (computing)|shell]], which is the traditional way of interacting with a Unix system. A Linux distribution specialized for servers may use the CLI as its only interface. A “headless system” run without even a monitor can be controlled by the command line via a protocol such as [[Secure Shell|SSH]] or [[telnet]].
Most low-level Linux components, including the GNU [[Userland (computing)|Userland]], use the CLI exclusively. The CLI is particularly suited for automation of repetitive or delayed tasks, and provides very simple [[inter-process communication]]. A graphical [[terminal emulator]] program is often used to access the CLI from a Linux desktop.
== Development ==
The primary difference between Linux and many other popular contemporary operating systems is that the [[Linux kernel]] and other components are [[free software|free]] and [[open source software]]. Linux is not the only such operating system, although it is the best-known and most widely used. Some [[free software license|free]] and [[open source license|open source]] software licences are based on the principle of [[copyleft]], a kind of reciprocity: any work derived from a copyleft piece of software must also be copyleft itself. The most common free software license, the [[GNU GPL]], is a form of copyleft, and is used for the Linux kernel and many of the components from the [[GNU project]].
As an operating system [[underdog (competition)|underdog]] competing with mainstream operating systems, Linux cannot rely on a [[monopoly]] advantage; in order for Linux to be convenient for users, Linux aims for [[interoperability]] with other operating systems and established computing standards. Linux systems adhere to [[POSIX]], [[Single UNIX Specification|SUS]], [[International Organization for Standardization|ISO]] and [[American National Standards Institute|ANSI]] standards where possible, although to date only one Linux distribution has been POSIX.1 certified, Linux-FT.
Free software projects, although developed in a [[Collaboration|collaborative]] fashion, are often produced independently of each other. However, given that the software licenses explicitly permit redistribution, this provides a basis for larger scale projects that collect the software produced by stand-alone projects and make it available all at once in the form of a [[Linux distribution]].
A [[Linux distribution]], commonly called a “distro”, is a project that manages a remote collection of Linux-based software, and facilitates installation of a Linux operating system. Distributions are maintained by individuals, loose-knit teams, volunteer organizations, and commercial entities. They include system software and [[application software]] in the form of ''packages'', and distribution-specific software for initial system installation and configuration as well as later package upgrades and installs. A distribution is responsible for the default configuration of installed Linux systems, system security, and more generally integration of the different software packages into a coherent whole.
=== Community ===
Linux is largely driven by its developer and user communities. Some vendors develop and fund their distributions on a volunteer basis, [[Debian]] being a well-known example. Others maintain a community version of their commercial distributions, as [[Red Hat]] does with [[Fedora (Linux distribution)|Fedora]].
In many cities and regions, local associations known as [[Linux Users Group]]s (LUGs) seek to promote Linux and by extension free software. They hold meetings and provide free demonstrations, training, technical support, and operating system installation to new users. There are also many [[Internet]] communities that seek to provide support to Linux users and developers. Most distributions and open source projects have [[IRC]] chatrooms or [[newsgroup]]s. [[Online forum]]s are another means for support, with notable examples being [[LinuxQuestions.org]] and the [[Gentoo Linux|Gentoo]] forums. Linux distributions host [[mailing list]]s; commonly there will be a specific topic such as usage or development for a given list.
There are several technology websites with a Linux focus. [[Linux Weekly News]] is a weekly digest of Linux-related news; the [[Linux Journal]] is an online magazine of Linux articles published monthly; [[Slashdot]] is a technology-related news website with many stories on Linux and open source software; [[Groklaw]] has written in depth about Linux-related legal proceedings and there are many articles relevant to the Linux kernel and its relationship with [[GNU]] on the [[GNU Project|GNU project's]] website. Print [[magazine]]s on Linux often include [[cover disk]]s including software or even complete Linux distributions.
Although Linux is generally available free of charge, several large corporations have established business models that involve selling, supporting, and contributing to Linux and free software. These include [[Dell]], [[IBM]], [[Hewlett-Packard|HP]], [[Sun Microsystems]], [[Novell]], and [[Red Hat]]. The free software licenses on which Linux is based explicitly accommodate and encourage commercialization; the relationship between Linux as a whole and individual vendors may be seen as [[symbiosis|symbiotic]]. One common business model of commercial suppliers is charging for support, especially for business users. A number of companies also offer a specialized business version of their distribution, which adds proprietary support packages and tools to administer higher numbers of installations or to simplify administrative tasks. Another business model is to give away the software in order to sell hardware.
=== Programming on Linux ===
Most Linux distributions support dozens of [[programming language]]s. The most common collection of utilities for building both Linux applications and operating system programs is found within the [[GNU toolchain]], which includes the [[GNU Compiler Collection]] (GCC) and the [[GNU build system]]. Amongst others, GCC provides compilers for [[Ada (programming language)|Ada]], [[C (programming language)|C]], [[C++]], [[Java (programming language)|Java]], and [[Fortran]]. The Linux kernel itself is written to be compiled with GCC. [[Proprietary software|Proprietary]] compilers for Linux include the [[Intel C++ Compiler]] and IBM XL C/C++ Compiler.
Most distributions also include support for [[Perl]], [[Ruby programming language|Ruby]], [[Python programming language|Python]] and other [[Dynamic programming language|dynamic languages]]. Examples of languages that are less common, but still well-supported, are [[C Sharp (programming language)|C#]] via the [[Mono (software)|Mono]] project, sponsored by [[Novell]], and [[Scheme programming language|Scheme]]. A number of [[Java Virtual Machine]]s and development kits run on Linux, including the original Sun Microsystems JVM ([[HotSpot]]), and IBM's J2SE RE, as well as many open-source projects like [[Kaffe]]. The two main frameworks for developing graphical applications are those of [[GNOME]] and [[KDE]]. These projects are based on the [[GTK+]] and [[Qt (toolkit)|Qt]] [[widget toolkit]]s, respectively, which can also be used independently of the larger framework. Both support a wide variety of languages. There are a number of [[Integrated development environment]]s available including [[Anjuta]], [[Code::Blocks]], [[Eclipse (computing)|Eclipse]], [[KDevelop]], [[Lazarus (software)|Lazarus]], [[MonoDevelop]], [[NetBeans]], and [[Omnis Studio]] while the long-established editors [[Vim (text editor)|Vim]] and [[Emacs]] remain popular.
== Uses ==
As well as those designed for general purpose use on desktops and servers, distributions may be specialized for different purposes including: [[computer architecture]] support, [[Embedded Linux|embedded systems]], stability, security, localization to a specific region or language, targeting of specific user groups, support for [[real-time computing|real-time]] applications, or commitment to a given desktop environment. Furthermore, some distributions deliberately include only [[free software]]. Currently, over three hundred distributions are actively developed, with about a dozen distributions being most popular for general-purpose use.
Linux is a widely [[porting|ported]] operating system. While the Linux kernel was originally designed only for [[Intel 80386]] [[microprocessor]]s, it now runs on a more diverse range of [[computer architecture]]s than any other operating system: in the hand-held [[ARM architecture|ARM]]-based [[iPAQ]] and the [[mainframe computer|mainframe]] [[IBM]] [[System z9]], in devices ranging from [[mobile phone]]s to [[supercomputer]]s. Specialized distributions exist for less mainstream architectures. The [[ELKS]] kernel [[fork (software development)|fork]] can run on [[Intel 8086]] or [[Intel 80286]] [[16-bit]] microprocessors, while the [[µClinux]] kernel fork may run on systems without a [[memory management unit]]. The kernel also runs on architectures that were only ever intended to use a manufacturer-created operating system, such as [[Macintosh]] computers, [[Personal digital assistant|PDA]]s, [[video game console]]s, [[Digital audio player|portable music players]], and [[mobile phone]]s.
=== Desktop ===
Although there is a lack of Linux ports for some [[Mac OS X]] and [[Microsoft Windows]] programs in domains such as [[desktop publishing]] and [[professional audio]], applications equivalent to those available for Mac and Windows are available for Linux.
Most Linux distributions provide a program for browsing a list of thousands of [[free software]] applications that have already been tested and configured for a specific distribution. These free programs can be downloaded and installed with one mouse click and a digital signature guarantees that no one has added a virus or a spyware to these programs.
Many [[free software]] titles that are popular on Windows, such as [[Pidgin (software)|Pidgin]], [[Mozilla Firefox]], [[Openoffice.org]], and [[GIMP]], are available for Linux. A growing amount of proprietary desktop software is also supported under Linux, examples being [[Adobe Flash Player]], [[Adobe Acrobat|Acrobat Reader]], [[Matlab]], [[Nero Burning ROM]], [[Opera (Internet suite)|Opera]], [[RealPlayer]], and [[Skype]]. In the field of animation and visual effects, most high end software, such as AutoDesk Maya, Softimage XSI and Apple Shake, is available for Linux, Windows and/or Mac OS X. [[CrossOver]] is a proprietary solution based on the open source [[Wine (software)|Wine]] project that supports running older Windows versions of [[Microsoft Office]] and [[Adobe Photoshop]] versions through CS2. [[Microsoft Office 2007]] and Adobe Photoshop CS3 are known not to work.
Besides the free Windows compatibility layer [[Wine (software)|Wine]], most distributions offer [[Dual boot]] and [[X86 virtualization]] for running both Linux and Windows on the same computer.
Linux's open nature allows distributed teams to [[L10n|localize]] Linux distributions for use in locales where localizing proprietary systems would not be cost-effective. For example the [[Sinhalese language]] version of the [[Knoppix]] distribution was available for a long time before [[Microsoft Windows XP]] was translated to Sinhalese. In this case the Lanka Linux User Group played a major part in developing the localized system by combining the knowledge of university professors, [[linguist]]s, and local developers.
The performance of Linux on the desktop has been a controversial topic, with at least one key Linux kernel developer, Con Kolivas, accusing the Linux community of favouring performance on servers. He quit Linux development because he was frustrated with this lack of focus on the desktop, and then gave a 'tell all' interview on the topic.
=== Servers and supercomputers ===
Historically, Linux has mainly been used as a [[Server (computing)|server]] operating system, and has risen to prominence in that area; [[Netcraft]] reported in September 2006 that eight of the ten most reliable internet hosting companies run Linux on their [[web server]]s. This is due to its relative stability and long uptime, and the fact that desktop software with a graphical user interface for servers is often unneeded. Enterprise and non-enterprise Linux distributions may be found running on servers. Linux is the cornerstone of the [[LAMP (software bundle)|LAMP]] server-software combination (Linux, [[Apache HTTP Server|Apache]], [[MySQL]], [[Perl]]/[[PHP]]/[[Python (programming language)|Python]]) which has achieved popularity among developers, and which is one of the more common platforms for website hosting.
Linux is commonly used as an operating system for [[supercomputer]]s. As of [[November 2007]], out of the top 500 systems, 426 (85.2%) run Linux.
=== Embedded devices ===
Due to its low cost and ability to be easily modified, an [[embedded Linux]] is often used in [[embedded systems]]. Linux has become a major competitor to the proprietary [[Symbian OS]] found in the majority of smartphones — 16.7% of [[smartphone]]s sold worldwide during 2006 were using Linux — and it is an alternative to the proprietary [[Windows CE]] and [[Palm OS]] operating systems on [[mobile device]]s. Cell phones or PDAs running on Linux and built on open source platform became a trend from 2007, like [[Nokia N810]], [[Openmoko]]'s [[Neo1973]] and the on-going [[Google Android]]. The popular [[TiVo]] digital video recorder uses a customized version of Linux. Several network [[firewall]] and [[router]] standalone products, including several from [[Linksys]], use Linux internally, using its advanced firewall and routing capabilities. The [[Korg OASYS]] and the [[Yamaha Motif|Yamaha Motif XS]] [[music workstation]]s also run Linux. Further more Linux is used in the leading [[stage lighting]] control system, FlyingPig/HighEnd WholeHogIII Console .
=== Market share and uptake ===
Many quantitative studies of open source software focus on topics including market share and reliability, with numerous studies specifically examining Linux. The Linux market is growing rapidly, and the revenue of servers, desktops, and packaged software running Linux is expected to exceed $35.7 billion by 2008. [[International Data Corporation|IDC]]'s report for Q1 2007 says that Linux now holds 12.7% of the overall server market. This estimate was based on the number of Linux servers sold by various companies.
Desktop adoption of Linux is approximately 1%. In comparison, [[List of Microsoft operating systems|Microsoft operating systems]] hold more than 90%.
The frictional cost of switching operating systems and lack of support for certain hardware and application programs designed for [[Microsoft Windows]] have been two factors that have inhibited adoption. Proponents and analysts attribute the relative success of Linux to its security, reliability, low cost, and freedom from [[vendor lock-in]].
Also most recently Google has begun to fund [[Wine (software)|Wine]], which acts as a compatibility layer, allowing users to run some Windows programs under Linux.
The [[OLPC XO-1|XO laptop]] project of One Laptop Per Child is creating a new and potentially much larger Linux community, planned to reach [http://www.laptop.org/en/vision/mission/index.shtml several hundred million schoolchildren] and their families and communities in developing countries. [http://wiki.laptop.org/go/countries Six countries] have ordered a million or more units each for delivery in 2007 to distribute to schoolchildren at no charge. [[Google]], [[Red Hat]], and [[eBay]] are major supporters of the project.
== Copyright and naming ==
The Linux kernel and most GNU software are [[software license|license]]d under the [[GNU General Public License]] (GPL). The GPL requires that anyone who distributes the Linux kernel must make the source code (and any modifications) available to the recipient under the same terms. In 1997, Linus Torvalds stated, “Making Linux GPL'd was definitely the best thing I ever did.” Other key components of a Linux system may use other licenses; many libraries use the [[GNU Lesser General Public License]] (LGPL), a more permissive variant of the GPL, and the [[X Window System]] uses the [[MIT License]].
Torvalds has publicly stated that he would not move the Linux kernel (currently licensed under GPL version 2) to version 3 of the GPL, released in mid-2007, specifically citing some provisions in the new license which prohibit the use of the software in [[digital rights management]].
A 2001 study of [[Red Hat Linux]] 7.1 found that this distribution contained 30 million [[source lines of code]]. Using the [[COCOMO|Constructive Cost Model]], the study estimated that this distribution required about eight thousand man-years of development time. According to the study, if all this software had been developed by conventional [[proprietary software|proprietary]] means, it would have cost about 1.08 billion dollars (year 2000 U.S. dollars) to develop in the United States.
Most of the code (71%) was written in the [[C (programming language)|C]] [[computer programming|programming]] [[programming language|language]], but many other languages were used, including [[C++]], [[assembly language]], [[Perl]], [[Python (programming language)|Python]], [[Fortran]], and various [[shell script]]ing languages. Slightly over half of all lines of code were licensed under the GPL. The Linux kernel itself was 2.4 million lines of code, or 8% of the total.
In a later study, the same analysis was performed for Debian GNU/Linux version 4.0. This distribution contained over 283 million source lines of code, and the study estimated that it would have cost 5.4 billion Euros to develop by conventional means.
In the United States, the name ''Linux'' is a [[trademark]] registered to Linus Torvalds. Initially, nobody registered it, but on [[August 15]] [[1994]], William R. Della Croce, Jr. filed for the trademark ''Linux'', and then demanded royalties from Linux distributors. In 1996, Torvalds and some affected organizations sued him to have the trademark assigned to Torvalds, and in 1997 the case was settled. The licensing of the trademark has since been handled by the [[Linux Mark Institute]]. Torvalds has stated that he only trademarked the name to prevent someone else from using it, but was bound in 2005 by [[United States trademark law]] to take active measures to enforce the trademark. As a result, the LMI sent out a number of letters to distribution vendors requesting that a fee be paid for the use of the name, and a number of companies have complied.
=== GNU/Linux ===
The [[Free Software Foundation]] views Linux distributions which use GNU software as [[GNU variants]] and they ask that such operating systems be referred to as ''GNU/Linux'' or ''a Linux-based GNU system''. However, the media and population at large refers to this family of operating systems simply as ''Linux''. While some distributors make a point of using the aggregate form, most notably [[Debian]] with the ''[[Debian GNU/Linux]]'' distribution, the term's use outside of the enthusiast community is limited. The distinction between the Linux kernel and distributions based on it plus the GNU system is a source of confusion to many newcomers, and the naming remains controversial, as many large Linux distributions (e.g. [[Ubuntu]] and [[SuSE]] Linux) are simply using the ''Linux'' name, rather than ''GNU/Linux''.
List of chatterbots
==Chatterbot Directories==
*
*[http://www.simonlaven.com Chatterbot Central] at [http://www.simonlaven.com The Simon Laven Page]
*[http://www.aidreams.co.uk/chatterbotcollection/index.htm The Chatterbot Collection]
*[http://www.aihub.org AI Hub] - A directory of news, programs, and links all related to chatterbots and Artificial Intelligence
*[http://www.chatterboxchallenge.com/bots_dir.php The Chatterbox Challenge Bots Directory] at [http://www.chatterboxchallenge.com The Chatterbox Challenge]
==Classic Chatterbots==
*[[Dr. Sbaitso]]
*[[ELIZA]]
*[[PARRY]]
*[[Racter]]
==General Chatterbots==
*[[Artificial Linguistic Internet Computer Entity|A.L.I.C.E.]] and other [[Alicebot]]/pandorabot-based ([http://www.titane.ca/concordia/dfar251/igod/main.html iGod], [http://www.mousebreaker.com/games/chatbot/play.php Mitsuku], [http://www.friendbot.co.uk FriendBot], etc.)
*[[Albert One]]
*[[ALIMbot]]
*[[CHAT and TIPS]]
*[http://www.chat-bot.com Chat-bot]
*[[Claude Chatterbot|Claude]]
*[http://www.dadorac.com Dadorac]
*[http://www.dai2.co.uk/ DAI2] - A dynamic artificial intelligence which learns from its surrounding community
*[http://www.elbot.com/ Elbot]
*[[Ella Chatterbot|Ella]]
*[[Fred Chatterbot|Fred]]
*[[Jabberwacky]]
*[http://www.abenteuermedien.de/jabberwock Jabberwock]
*[http://www.jeeney.com/ Jeeney AI]
*[http://www.jixperts.com?lang=en JIxperts] – collection of wiki chatterbots.
*[http://www.iaindustrie.fr.nf KAR Intelligent Computer]
*[http://www.leeds-city-guide.com/kyle Kyle] – A unique learning Artificial Intelligence chatbot, which employs contextual learning algorithms.
*[[MegaHal]]
*[[Mr Know-It-All]]
*Oliverbot
*[http://uk.geocities.com/mattbrown1101/ Poseidon]
*[http://www.infradrive.com/robomatic.php RoboMatic X1] - A chatbot which controls the user's PC through chatting by their voice or by typing.
*[http://www.cooldictionary.com/splotchy.mpl Splotchy]
*[[Starship Titanic#Spookitalk|Spookitalk]] - A chatterbot used for [[Non-player character|NPC]]s in [[Douglas Adams]]' ''Starship Titanic'' video game.
*[http://www.onebigspace.com/ Thomas]
*[[Ultra Hal Assistant]]
*[[Verbot]]
*[http://www.yhaken.com/ Yhaken]
*[http://www.scientiobot.com ScientioBot] - A new technology chatterbot using concept mining techniques accessible via a free web service.
*[http://nicole.jetaylor.net NICOLE] A simple chatterbot with the ability to learn new phrases.
==[[Instant messenger|IM]] Chatterbots==
*DAI2 is also available on the MSN / Windows Live network as dai2@dai2.co.uk
*[http://www.dnreg.org/bot/ MSN Quickbot]
*[http://www.smarterchild.com SmarterChild]
*[http://www.spleak.com Spleak]
*[http://www.mrmovie.com MrMovie] - searching actors/movies/dvd's in IM (Skype, AOL/AIM or MSN/Live)
*[[Inside Messenger Bot|InsideMessenger]]
*[http://www.inocu.jt-online.co.uk Inocu] - (MSN/Live)
*[http://www.friendbot.co.uk FriendBot-An AIM Chatterbot]
*[http://www.amsn-project.net/plugins.php amsnEliza plugin for aMSN]
*[[Inside Messenger Bot|TrixieMouse]]
*[http://www.infobot.pl/ Infobot] - Polish informational bot for Gadu-gadu, Skype and Jabber
==AIML Chatterbots==
*[http://www.taik.fi/turingenigma Alan] - In ''Turing Enigma'' Alan Turing's spirit has infiltrated the World War II encrypting device Enigma.
*[http://www.dustyant.com/projects/deebot/ Deeb0t]
*[http://www.pandorabots.com/pandora/talk?botid=b0dafd24ee35a477 Chomsky] A chatbot that uses a smiley face to convey emotions. It uses the information in Wikipedia to build its conversations and has links to Wikipedia articles.
*[[John Lennon Artificial Intelligence Project]]
*[[SitePal]]
==JFred Chatterbots==
*[[The Turing Hub]]
==Educational Chatterbots==
*[http://www.philocomp.net/?pageref=ai&page=elizabeth Elizabeth] Aims to teach AI techniques and concepts, starting from chatterbot design. Accompanied by self-teaching materials, as used at the University of Leeds.
==Non-English Chatterbots==
*[http://www.geocities.com/brizglace/amanda.htm Amanda] - (French) with source code for Windows.
*[[Proteus]]
*[msnim:chat?contact=senhorbot@hotmail.com Senhor Bot] (Brazillian bot for MSN)
Loebner prize
The '''Loebner Prize''' is an annual competition that awards prizes to the [[Chatterbot]] considered by the judges to be the most [[Artificial intelligence|humanlike]] of those entered. The format of the competition is that of a standard [[Turing test]]. In the Loebner Prize, as in a Turing test, a human judge is faced with two computer screens. One is under the control of a computer, the other is under the control of a human. The judge poses questions to the two screens and receives answers. Based upon the answers, the judge must decide which screen is controlled by the human and which is controlled by the computer program.
The contest was begun in 1990 by [[Hugh Loebner]] in conjunction with the [[Cambridge Center for Behavioral Studies]] of [[Massachusetts]], [[United States]]. It has since been associated with [[Flinders University]], [[Dartmouth College]], the [[Science Museum (London)|Science Museum]] in [[London]], and most recently the [[University of Reading]].
Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, [[Marvin Minsky]], has called it a publicity stunt that does not help the field along.
==Prizes==
The prizes for each year include:
* $2,000 for the most human-seeming of all chatterbots for that year - awarded every year. In 2005, the prize was increased to $3,000, and the prize was $2,250 in 2006. In 2008 the prize will be $3000.00
* $25,000 for the first chatterbot that judges cannot distinguish from a real human in a text-only Turing test, and that can convince judges that the other (human) entity they are talking to simultaneously is a computer. ''(to be awarded once only)''
* $100,000 to the first chatterbot that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input. ''(to be awarded once only)''
The Loebner Prize dissolves once the $100,000 prize is won.
==2008 Loebner Prize==
The 2008 Competition is to be held on Sunday [[12 October]] in University of Reading, [[United Kingdom|UK]]. The event, which is being co-directed by [[Kevin Warwick]], will include a direct challenge on the [[Turing test]] as originally proposed by [[Alan Turing]]. The first place winner will receive $3000.00 and a bronze medal.
==2007 Loebner Prize==
The 2007 Competition was held on Sunday, [[21 October]] in [[New York City]]. The participants in the contest were:
* [[Rollo Carpenter]] from Icogno, creator of [[Jabberwacky]]
* Noah Duncan, private entry, creator of Cletus
* Robert Medeksza from Zabaware, creator of [[Ultra Hal Assistant]]
No bot passed the Turing test but the judges ranked the bots as "most human". The results of the contest were:
* 1st place: Robert Medeksza
* 2nd place: Noah Duncan
* 3rd place: Rollo Carpenter
The winner received $2250 and the Annual Medal. The runners up received $250 each.
==2006 Loebner Prize==
On Wednesday, [[August 30]], the finalists for the 2006 Loebner Prize were announced.
The finalists were:
* Rollo Carpenter
* Richard Churchill and Marie-Claire Jenkins
* Noah Duncan
* Robert Medeksza
The contest was held on Sunday, [[17 September]] at the Torrington Theatre, [[University College London]].
==Winners==
Machine learning
As a broad subfield of [[artificial intelligence]], '''machine learning''' is concerned with the design and development of [[algorithm]]s and techniques that allow computers to "learn". At a general level, there are two types of learning: [[Inductive reasoning|inductive]], and [[Deductive reasoning|deductive]]. Inductive machine learning methods extract rules and patterns out of massive data sets.
The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods. Hence, machine learning is closely related not only to [[data mining]] and [[statistics]], but also [[theoretical computer science]].
==Applications==
Machine learning has a wide spectrum of applications including [[natural language processing]], [[syntactic pattern recognition]], [[search engines]], [[diagnosis|medical diagnosis]], [[bioinformatics]], [[brain-machine interfaces]] and [[cheminformatics]], detecting [[credit card fraud]], [[stock market]] analysis, classifying [[DNA sequence]]s, [[speech recognition|speech]] and [[handwriting recognition]], [[object recognition]] in [[computer vision]], [[strategy game|game playing]] and [[robot locomotion]].
== Human interaction ==
Some machine learning systems attempt to eliminate the need for human intuition in the analysis of the data, while others adopt a collaborative approach between human and machine. Human intuition cannot be entirely eliminated since the designer of the system must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data. Machine learning can be viewed as an attempt to automate parts of the [[scientific method]].
Some statistical machine learning researchers create methods within the framework of [[Bayesian statistics]].
== Algorithm types ==
Machine learning [[algorithm]]s are organized into a [[taxonomy]], based on the desired outcome of the algorithm. Common algorithm types include:
* [[Supervised learning]] — in which the algorithm generates a function that maps inputs to desired outputs. One standard formulation of the supervised learning task is the [[statistical classification|classification]] problem: the learner is required to learn (to approximate) the behavior of a function which maps a vector into one of several classes by looking at several input-output examples of the function.
* [[Unsupervised learning]] — An agent which models a set of inputs: labeled examples are not available.
* [[Semi-supervised learning]] — which combines both labeled and unlabeled examples to generate an appropriate function or classifier.
* [[Reinforcement learning]] — in which the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
* [[Transduction (machine learning)|Transduction]] — similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and test inputs which are available while training.
* [[Leaning to learn]] — in which the algorithm learns its own [[inductive bias]] based on previous experience.
The computational analysis of machine learning algorithms and their performance is a branch of [[theoretical computer science]] known as [[computational learning theory]].
== Machine learning topics ==
:''This list represents the topics covered on a typical machine learning course.''
;Prerequisites
*[[Bayesian theory]]
;Modeling [[conditional probability|conditional probability density functions]]: [[Regression analysis|regression]] and [[Statistical classification|classification]]
*[[Artificial neural network]]s
*[[Decision tree]]s
*[[Gene expression programming]]
*[[Genetic algorithms]]
*[[Genetic programming]]
*[[Holographic associative memory]]
*[[Inductive Logic Programming]]
*[[Kriging|Gaussian process regression]]
*[[Linear discriminant analysis]]
*[[Nearest neighbor (pattern recognition)|K-nearest neighbor]]
*[[Minimum message length]]
*[[Perceptron]]
*[[Quadratic classifier]]
*[[Radial basis function network]]s
*[[Support vector machine]]s
;Algorithms for estimating model parameters:
*[[Dynamic programming]]
*[[Expectation-maximization algorithm]]
;Modeling [[probability density function]]s through [[generative model]]s:
*[[Graphical model]]s including [[Bayesian network]]s and [[Markov network|Markov random fields]]
*[[Generative topographic map]]
;Approximate inference techniques
*[[Monte Carlo method]]s
*[[Variational Bayes]]
*[[Variable-order Markov model]]s
*[[Variable-order Bayesian network]]s
*[[Loopy belief propagation]]
;Optimization
*Most of methods listed above either use [[Optimization (mathematics)|optimization]] or are instances of optimization algorithms
;Meta-learning (ensemble methods)
*[[Boosting]]
*[[Bootstrap aggregating]]
*[[Random forest]]
*[[Weighted majority algorithm]]
;Inductive transfer and learning to learn
*[[Inductive transfer]]
*[[Reinforcement learning]]
*[[Temporal difference learning]]
*[[Monte-Carlo method]]
Machine translation
Machine translation''', sometimes referred to by the abbreviation '''MT''', is a sub-field of [[computational linguistics]] that investigates the use of [[computer software]] to [[translation|translate]] text or speech from one [[natural language]] to another. At its basic level, MT performs simple [[substitution]] of words in one natural language for words in another. Using [[corpus linguistics|corpus]] techniques, more complex translations may be attempted, allowing for better handling of differences in [[linguistic typology]], phrase [[recognition]], and translation of [[idiom]]s, as well as the isolation of anomalies.
Current machine translation software often allows for customisation by domain or [[profession]] (such as [[meteorology|weather reports]]) — improving output by limiting the scope of allowable substitutions. This technique is particularly effective in domains where formal or formulaic language is used. It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text.
Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has [[word sense disambiguation|unambiguously identified]] which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is". However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.
==History==
The history of machine translation begins in the 1950s, after [[World War II]]. The [[Georgetown-IBM experiment|Georgetown experiment]] (1954) involved fully-automatic translation of over sixty [[Russian language|Russian]] sentences into [[English language|English]]. The experiment was a great success and ushered in an era of substantial funding for machine-translation research. The authors claimed that within three to five years, machine translation would be a solved problem.
Real progress was much slower, however, and after the [[ALPAC|ALPAC report]] (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced. Beginning in the late 1980s, as [[computation]]al power increased and became less expensive, more interest was shown in [[statistical machine translation|statistical models for machine translation]].
The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A.D.Booth and possibly others. The Georgetown experiment was by no means the first such application, and a demonstration was made in 1954 on the APEXC machine at Birkbeck College (London Univ.) of a rudimentary translation of English into French. Several papers on the topic were published at the time, and even articles in popular journals (see for example Wireless World, Sept. 1955, Cleave and Zacharov). A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer.
Recently, Internet has emerged as global information infrastructure, revolutionizing access to any information, as well as fast information transfer and exchange. Using Internet and e-mail technology, people need to communicate rapidly over long distances across continent boundaries. Not all of these Internet users, however, can use their own language for global communication to different people with different languages. Therefore, using machine translation software, people can possibly communicate and contact one to another around the world in their own mother tongue, in the near future.
==Translation process==
The [[translation process]] may be stated as:
# [[Decoding]] the [[meaning (linguistic)|meaning]] of the [[source text]]; and
# Re-[[encoding]] this [[meaning (linguistic)|meaning]] in the [[target language]].
Behind this ostensibly simple procedure lies a complex [[cognitive]] operation. To decode the meaning of the [[source text]] in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the [[grammar]], [[semantics]], [[syntax]], [[idiom]]s, etc., of the [[source language]], as well as the [[culture]] of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the [[target language]].
Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the [[target language]] that "sounds" as if it has been written by a person.
This problem may be approached in a number of ways.
==Approaches==
Machine translation can use a method based on [[Expert System|linguistic rules]], which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language.
It is often argued that the success of machine translation requires the problem of [[natural language processing|natural language understanding]] to be solved first.
Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated. According to the nature of the intermediary representation, an approach is described as [[interlingual machine translation]] or [[transfer-based machine translation]]. These methods require extensive [[lexicon]]s with [[morphology (linguistics)|morphological]], [[syntax|syntactic]], and [[semantics|semantic]] information, and large sets of rules.
Given enough data, machine translation programs often work well enough for a [[native speaker]] of one language to get the approximate meaning of what is written by the other native speaker. The difficulty is getting enough data of the right kind to support the particular method. For example, the large multilingual [[Text corpus|corpus]] of data needed for statistical methods to work is not necessary for the grammar-based methods. But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.
To translate between closely related languages, a technique referred to as [[shallow-transfer machine translation]] may be used.
===Rule-based===
The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms.
'''''Transfer-based machine translation'''''
'''''Interlingual'''''
Interlingual machine translation is one instance of rule-based machine-translation approaches. In this approach, the source language, i.e. the text to be translated, is transformed into an interlingual, i.e. source-/target-language-independent representation. The target language is then generated out of the [[interlinguistics|interlingua]].
'''''Dictionary-based'''''
Machine translation can use a method based on [[dictionary]] entries, which means that the words will be translated as they are by a dictionary.
===Statistical===
Statistical machine translation tries to generate translations using [[statistical methods]] based on bilingual text corpora, such as the [[Hansard#Canadian hansard and machine translation|Canadian Hansard]] corpus, the English-French record of the Canadian parliament and [[EUROPARL]], the record of the [[European Parliament]]. Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare. The first statistical machine translation software was [[CANDIDE]] from [[IBM]]. Google used [[SYSTRAN]] for several years, but has switched to a statistical translation method in October 2007. Recently, they improved their translation capabilities by inputting
approximately 200 billion words from [[United Nations]] materials to train their system. Accuracy of the translation has improved.
===Example-based===
Example-based machine translation (EBMT) approach is often characterised by its use of a bilingual [[corpus]] as its main knowledge base, at run-time. It is essentially a translation by [[analogy]] and can be viewed as an implementation of [[case-based reasoning]] approach of [[machine learning]].
==Major issues==
===Disambiguation===
Word sense disambiguation concerns finding a suitable translation when a word can have more than one meaning. The problem was first raised in the 1950s by [[Yehoshua Bar-Hillel]]. He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word. Today there are numerous approaches designed to overcome this problem. They can be approximately divided into "shallow" approaches and "deep" approaches.
Shallow approaches assume no knowledge of the text. They simply apply statistical methods to the words surrounding the ambiguous word. Deep approaches presume a comprehensive knowledge of the word. So far, shallow approaches have been more successful.
===Named entities===
Related to [[named entity recognition]] in [[information extraction]].
==Applications==
There are now many [[software]] programs for translating natural language, several of them [[online]], such as the [[SYSTRAN]] system which powers both [[Google]] translate and [[AltaVista]]'s [[Babel Fish (website)|Babel Fish]] as well as [[Promt]] that powers online translation services at Voila.fr and Orange.fr. Although no system provides the holy grail of "fully automatic high quality machine translation" (FAHQMT), many systems produce reasonable output.
Despite their inherent limitations, MT programs are used around the world. Probably the largest institutional user is the [[European Commission]]. [[Toggletext]] uses a transfer-based system (known as Kataku) to translate between [[English language|English]] and [[Indonesian language|Indonesian]]. [[Google]] has claimed that promising results were obtained using a proprietary statistical machine translation engine. The statistical translation engine used in the [[Google tools#anchor_language_tools|Google language tools]] for Arabic <-> English and Chinese <-> English has an overall score of 0.4281 over the runner-up IBM's BLEU-4 score of 0.3954 (Summer 2006) in tests conducted by the National Institute for Standards and Technology. [[Uwe Muegge]] has implemented a demo website that uses a [[controlled language]] in combination with the [[Google tools#anchor_language_tools|Google tool]] to produce fully automatic, high-quality machine translations of his English, German, and French web sites.
With the recent focus on terrorism, the military sources in the United States have been investing significant amounts of money in natural language engineering. ''In-Q-Tel'' (a [[venture capital]] fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like [[Language Weaver]]. Currently the military community is interested in translation and processing of languages like [[Arabic language|Arabic]], [[Pashto language|Pashto]], and [[Dari language|Dari]]. Information Processing Technology Office in [[DARPA]] hosts programs like [[DARPA TIDES program|TIDES]] and [[Babylon translator|Babylon Translator]]. US Air Force has awarded a $1 million contract to develop a language translation technology.
== Evaluation ==
There are various means for evaluating the performance of machine-translation systems. The oldest is the use of human judges to assess a translation's quality. Even though human evaluation is time-consuming, it is still the most reliable way to compare different systems such as rule-based and statistical systems. [[Automate]]d means of evaluation include [[Bilingual evaluation understudy|BLEU]], [[NIST (metric)|NIST]] and [[METEOR]].
Relying exclusively on machine translation ignores that communication in [[natural language|human language]] is [[wiktionary:context|context]]-embedded, and that it takes a human to adequately comprehend the context of the original text. Even purely human-generated translations are prone to error. Therefore, to ensure that a machine-generated translation will be of publishable quality and useful to a human, it must be reviewed and edited by a human.
It has, however, been asserted that in certain applications, e.g. product descriptions written in a [[controlled language]], a [[dictionary-based machine translation|dictionary-based machine-translation]] system has produced satisfactory translations that require no human intervention.
Metadata
'''Metadata''' ('''meta data''', or sometimes '''metainformation''') is "data about data", of any sort in any media. An item of metadata may describe an individual [[datum]], or content item, or a collection of data including multiple content items and hierarchical levels, for example a [[database schema]].
== Purpose ==
Metadata provides context for data.
Metadata is used to facilitate the understanding, characteristics, and management usage of data. The metadata required for effective data management varies with the type of data and context of use. In a [[library]], where the data is the content of the titles stocked, metadata about a title would typically include a description of the content, the [[author]], the publication date and the physical location.
== Examples of Metadata ==
=== Camera ===
In the context of a [[camera]], where the data is the photographic image, metadata would typically include the date the [[photograph]] was taken and details of the camera settings (lens, focal length, aperture, shutter timing, white balance, etc.).
=== Digital Music Player ===
On a digital portable music player, the album names, song titles and album art embedded in the music files are used to generate the artist and song listings, and are considered the metadata.
=== Information system ===
In the context of an [[information system]], where the data is the content of the [[computer]] files, metadata about an individual data item would typically include the name of the field and its length. Metadata about a collection of data items, a computer file, might typically include the name of the file, the type of file and the name of the data administrator.
''Italic text''
=== Real world location ===
If we consider a particular place in the real world, this may be described by data, for example:
* 1 "E83BJ" .
* 2 "17"
* 3 "Sunny"
To make sense of and use this data, context is important, and can be provided by metadata. The metadata for the above three items of data might include:
* 1.1 "Post Code" – This is a brief description (or name) of the data item "E83BJ"
* 1.2 "The unique identifier of a postal district" – This is another description (a definition) of "E83BJ"
* 1.3 "27 June 2006" – This could also help describe "E83BJ", for example by giving the date it was last updated
* 2 "Average temperature in degrees Celsius" – This is a possible description of "17"
* 3 "Yesterday's weather" – This is a description of "sunny"
An item of metadata is itself data and therefore may have its own metadata. For example, "Post Code" might have the following metadata:
* 1.1.1 "data item name"
* 1.1.2 "5 characters, starting with A – Z"
"27 June 2006" might have the following metadata:
* 1.3.1 "date last changed"
* 1.3.2 "dd MMM yyyy"
== Levels ==
The hierarchy of metadata descriptions can go on forever, but usually context or semantic understanding makes extensively detailed explanations unnecessary.
The role played by any particular [[datum]] depends on the context. For example, when considering the geography of London, "E83BJ" would be a datum and "Post Code" would be metadatum. But, when considering the data management of an automated system that manages geographical data, "Post Code" might be a datum and then "data item name" and "5 characters, starting with A – Z" would be metadata.
In any particular context, metadata characterizes the data it describes, not the entity described by that data. So, in relation to "E83BJ", the datum "is in London" is a further description of the place in the real world which has the post code "E83BJ", not of the code itself. Therefore, although it is providing information connected to "E83BJ" (telling us that this is the post code of a place in London), this would not normally be considered metadata, as it is describing "E83BJ" ''qua'' place in the real world and not ''qua'' data.
== Definitions ==
=== Etymology ===
[[Meta]] is a classical Greek preposition (μετ’ αλλων εταιρων) and prefix (μεταβασις) conveying the following senses in English, depending upon the case of the associated noun: among; along with; with; by means of; in the midst of; after; behind. In [[epistemology]], the word means "about (its own category)"; thus metadata is "data about the data".
=== Varying definitions ===
The term was introduced intuitively, without a formal definition. Because of that, today there are various definitions. The most common one is the literal translation:
* "Data about data are referred to as metadata."
Example: "12345" is data, and with no additional context is meaningless. When "12345" is given a meaningful name (metadata) of "[[ZIP code]]", one can understand (at least in the [[United States]], and further placing "ZIP code" within the context of a [[postal address]]) that "12345" refers to the [[General Electric]] plant in [[Schenectady, New York]].
As for most people the difference between data and [[information]] is merely a [[philosophical]] one of no relevance in practical use, other definitions are:
* Metadata is information about data.
* Metadata is information about information.
* Metadata contains information about that data or other data
There are more sophisticated definitions, such as:
*"Metadata is structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities."
* "[Metadata is a set of] optional structured descriptions that are publicly available to explicitly assist in locating objects."
These are used more rarely because they tend to concentrate on one purpose of metadata — to find "objects", "entities" or "resources" — and ignore others, such as using metadata to optimize [[data compression|compression algorithms]], or to perform additional computations using the data.
The metadata concept has been extended into the world of systems to include any "data about data": the names of tables, columns, programs, and the like. Different views of this "system metadata" are detailed below, but beyond that is the recognition that metadata can describe all aspects of systems: data, activities, people and organizations involved, locations of data and processes, access methods, limitations, timing and events, as well as motivation and rules.
Fundamentally, then, metadata is "the data that describe the structure and workings of an organization's use of information, and which describe the systems it uses to manage that information". To do a model of metadata is to do an "[[Enterprise modeling|Enterprise model]]" of the information technology industry itself.
=== Metadata and Markup ===
In the context of the web and the work of the [[W3C]] in providing markup technologies of [[HTML]], [[XML]] and [[SGML]] the concept of metadata has specific context that is perhaps clearer than in other information domains. With markup technologies there is metadata, markup and data content. The metadata describes characteristics about the data, while the markup identifies the specific type of data content and acts as a container for that document instance. This page in Wikipedia is itself an example of such usage, where the textual information is data, how it is packaged, linked, referenced, styled and displayed is markup and aspects and characteristics of that markup are metadata set globally across Wikipedia.
In the context of markup the metadata is architected to allow optimization of document instances to contain only a minimum amount of metadata, while the metadata itself is likely referenced externally such as in a [[schema]] definition ([[XSD]]) instance. Also it should be noted that markup provides specialised mechanisms that handle referential data, again avoiding confusion over what is metadata or data, and allowing optimizations. The reference and ID mechanisms in markup allowing reference links between related data items, and links to data items that can then be repeated about a data item, such as an address or product details. These are then all themselves simply more data items and markup instances rather than metadata.
Similarly there are concepts such as classifications, ontologies and associations for which markup mechanisms are provided. A data item can then be linked to such categories via markup and hence providing a clean delineation between what is metadata, and actual data instances. Therefore the concepts and descriptions in a classification would be metadata, but the actual classification entry for a data item is simply another data instance.
Some examples can illustrate the points here. Items in bold are data content, in italic are metadata, normal text items are all markup.
The two examples show in-line use of metadata within markup relating to a data instance (XML) compared to simple markup (HTML).
A simple [[HTML]] instance example:
<span style="normalText">'''Example'''</span>
And then a [[XML]] instance example with metadata:
'''John'''
Where the inline assertion that a person's middle name may be an empty data item is metadata about the data item. Such definitions however are usually not placed inline in XML. Instead these definitions are moved away into the [[schema]] definition that contains the metadata for the entire document instance. This again illustrates another important aspect of metadata in the context of markup. The metadata is optimally defined only once for a collection of data instances. Hence repeated items of markup are rarely metadata, but rather more markup data instances themselves.
=== Hierarchies of metadata ===
When structured into a hierarchical arrangement, metadata is more properly called an [[Ontology (computer science)|ontology]] or [[schema]]. Both terms describe "what exists" for some purpose or to enable some action. For instance, the arrangement of subject headings in a library catalog serves not only as a guide to finding books on a particular subject in the stacks, but also as a guide to what subjects "exist" in the library's own ontology and how more specialized topics are related to or derived from the more general subject headings.
Metadata is frequently stored in a central location and used to help organizations standardize their data. This information is typically stored in a [[metadata registry]].
=== Difference between data and metadata ===
Usually it is not possible to distinguish between (plain) data and metadata because:
*Something can be data and metadata at the same time. The headline of an article is both its title (metadata) and part of its text (data).
* Data and metadata can change their roles. A poem, as such, would be regarded as data, but if there were a song that used it as lyrics, the whole poem could be attached to an audio file of the song as metadata. Thus, the labeling depends on the point of view.
These considerations apply no matter which of the above definitions is considered, except where explicit markup is used to denote what is data and what is metadata.
== Use ==
Metadata has many different applications; this section lists some of the most common.
Metadata is used to speed up and enrich searching for resources. In general, search queries using metadata can save users from performing more complex filter operations manually. It is now common for web browsers (with the notable exception of Mozilla Firefox), P2P applications and media management software to automatically download and locally cache metadata, to improve the speed at which files can be accessed and searched.
Metadata may also be associated to files manually. This is often the case with documents which are scanned into a document storage repository such as FileNet or Documentum. Once the documents have been converted into an electronic format a user brings the image up in a viewer application, manually reads the document and keys values into an online application to be stored in a metadata repository.
Metadata provide additional information to users of the data it describes. This information may be descriptive ("These pictures were taken by children in the school's third grade class.") or algorithmic ("Checksum=139F").
Metadata helps to bridge the [[semantic gap]]. By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process even more complex filter and search operations. For example, if a search engine understands that "Van Gogh" was a "Dutch painter", it can answer a search query on "Dutch painters" with a link to a web page about Vincent Van Gogh, although the exact words "Dutch painters" never occur on that page. This approach, called knowledge representation, is of special interest to the [[semantic web]] and [[artificial intelligence]].
Certain metadata is designed to optimize [[lossy compression]]. For example, if a video has metadata that allows a computer to tell foreground from background, the latter can be compressed more aggressively to achieve a higher compression rate.
Some metadata is intended to enable variable content presentation. For example, if a picture has metadata that indicates the most important region — the one where there is a person — an image viewer on a small screen, such as on a mobile phone's, can narrow the picture to that region and thus show the user the most interesting details. A similar kind of metadata is intended to allow blind people to access diagrams and pictures, by converting them for special output devices or reading their description using [[speech synthesis|text-to-speech]] software.
Other descriptive metadata can be used to automate workflows. For example, if a "smart" software tool knows content and structure of data, it can convert it automatically and pass it to another "smart" tool as input. As a result, users save the many [[cut, copy and paste|copy-and-paste]] operations required when analyzing data with "dumb" tools.
Metadata is becoming an increasingly important part of [[electronic discovery]]. [http://www.lexbe.com/hp/indepth-e-discovery-rule-metadata.htm] Application and file system metadata derived from [[electronic document]]s and files can be important evidence. Recent changes to the [[Federal Rules of Civil Procedure]] make metadata routinely discoverable as part of [[Civil law (common law)|civil litigation]]. Parties to litigation are required to maintain and produce metadata as part of [[discovery (law)|discovery]], and [[spoliation of evidence|spoliation]] of metadata can lead to sanctions.
Metadata has become important on the [[World Wide Web]] because of the need to find useful information from the mass of information available. Manually-created metadata adds value because it ensures consistency. If a web page about a certain topic contains a word or phrase, then all web pages about that topic should contain that same word or phrase. Metadata also ensures variety, so that if a topic goes by two names each will be used. For example, an article about "[[sport utility vehicle]]s" would also be [[tag (metadata)|tagged]] "4 wheel drives", "4WDs" and "four wheel drives", as this is how SUVs are known in some countries.
Examples of metadata for an [[Compact Disc|audio CD]] include the [[MusicBrainz]] project and [[All Media Guide]]'s [[Allmusic]]. Similarly, [[MP3]] files have metadata tags in a format called [[ID3]].
== Types of metadata ==
Metadata can be classified by:
* Content. Metadata can either describe the ''resource'' itself (for example, name and size of a file) or the ''content'' of the resource (for example, "This video shows a boy playing football").
* Mutability. With respect to the whole resource, metadata can be either ''immutable'' (for example, the "Title" of a video does not change as the video itself is being played) or ''mutable'' (the "Scene description" does change).
* Logical function. There are three layers of logical function: at the bottom the ''subsymbolic'' layer that contains the raw data itself, then the ''symbolic'' layer with metadata describing the raw data, and on the top the ''logical'' layer containing metadata that allows logical reasoning using the symbolic layer
== Important issues ==
To successfully develop and use metadata, several important issues should be treated with care:
=== Metadata risks ===
[[Microsoft Office]] files include metadata beyond their printable content, such as the original author's name, the creation date of the document, and the amount of time spent editing it. Unintentional disclosure can be awkward or even, in professional practices requiring confidentiality, raise malpractice concerns. Some of Microsoft Office document's metadata can be seen by clicking ''File'' then ''Properties'' from the program's menu. Other metadata is not visible except through external analysis of a file, such as is done in forensics. The author of the Microsoft Word-based [[Melissa (computer worm)|Melissa]] computer virus in 1999 was caught due to Word metadata that uniquely identified the computer used to create the original infected document.
=== Metadata lifecycle ===
Even in the early phases of planning and designing it is necessary to keep track of all metadata created. It is not economical to start attaching metadata only after the production process has been completed. For example, if metadata created by a digital camera at recording time is not stored immediately, it may have to be restored afterwards manually with great effort. Therefore, it is necessary for different groups of resource producers to cooperate using compatible methods and standards.
* Manipulation. Metadata must adapt if the resource it describes changes. It should be merged when two resources are merged. These operations are seldom performed by today's software; for example, image editing programs usually do not keep track of the [[Exchangeable image file format|Exif]] metadata created by digital cameras.
* Destruction. It can be useful to keep metadata even after the resource it describes has been destroyed, for example in change histories within a text document or to archive file deletions due to digital rights management. None of today's metadata standards consider this phase.
=== Storage ===
Metadata can be stored either ''internally'', in the same file as the data, or ''externally'', in a separate file. Metadata that are embedded with content is called ''embedded metadata''. A data repository typically stores the metadata ''detached'' from the data. Both ways have advantages and disadvantages:
*Internal storage allows transferring metadata together with the data it describes; thus, metadata is always at hand and can be manipulated easily. This method creates high redundancy and does not allow holding metadata together.
* External storage allows bundling metadata, for example in a database, for more efficient searching. There is no redundancy and metadata can be transferred simultaneously when using [[streaming media|streaming]]. However, as most formats use [[Uniform Resource Identifier|URI]]s for that purpose, the method of how the metadata is linked to its data should be treated with care. What if a resource does not have a URI (resources on a local hard disk or web pages that are created on-the-fly using a content management system)? What if metadata can only be evaluated if there is a connection to the Web, especially when using [[Resource Description Framework|RDF]]? How to realize that a resource is replaced by another with the same name but different content?
Moreover, there is the question of data format: storing metadata in a human-readable format such as XML can be useful because users can understand and edit it without specialized tools. On the other hand, these formats are not optimized for storage capacity; it may be useful to store metadata in a binary, non-human-readable format instead to speed up transfer and save memory.
== Criticisms ==
Although the majority of computer scientists see metadata as a chance for better interoperability, some critics argue:
*Metadata is too expensive and time-consuming. The argument is that companies will not produce metadata without need because it costs extra money, and private users also will not produce complex metadata because its creation is very time-consuming.
* Metadata is too complicated. Private users will not create metadata because existing formats, especially [[MPEG-7]], are too complicated. As long as there are no automatic tools for creating metadata, it will not be created.
* Metadata is subjective and depends on context. Most probably, two persons will attach different metadata to the same resource due to their different points of view. Moreover, metadata can be misinterpreted due to its dependency on context. For example searching for "post-modern art" may miss a certain item because the expression was not in use at the time when that work of art was created, or searching for "pictures taken at 1:00" may produce confusing results due to local time differences.
* There is no end to metadata. For example, when annotating a match of soccer with metadata, one can describe all the players and their actions in time and stop there. One can also describe the advertisements in the background and the clothes the players wear. One can also describe each fan on the tribune and the clothes they wear. All of this metadata can be interesting to one party or another — such as the spectators, sponsors or a counter-terrorist unit of the police — and even for a simple resource the amount of possible metadata can be gigantic.
* Metadata is useless. Many of today's search engines are very efficient at finding text. Other techniques for finding pictures, videos and music (namely query-by-example) will become more and more powerful in the future. Thus, there is no real need for metadata.
The opposers of metadata sometimes use the term [[metacrap]] to refer to the unsolved problems of metadata in some scenarios.
These people are also referred to as "Meta Haters."
== Types ==
In general, there are two distinct classes of metadata: structural or control metadata and guide metadata. Structural metadata is used to describe the structure of computer systems such as tables, columns and indexes. Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language.
Metatadata can be divided into 3 distinct categories:
* Descriptive
* Administrative
* Structural
=== Relational database metadata ===
Each [[relational database]] system has its own mechanisms for storing metadata. Examples of relational-database metadata include:
*Tables of all tables in database, their names, sizes and number of rows in each table.
* Tables of columns in each database, what tables they are used in, and the type of data stored in each column.
In database terminology, this set of metadata is referred to as the [[database catalog|catalog]]. The [[SQL]] standard specifies a uniform means to access the catalog, called the INFORMATION_SCHEMA, but not all databases implement it, even if they implement other aspects of the SQL standard. For an example of database-specific metadata access methods, see [[Oracle metadata]].
=== Data warehouse metadata ===
[[Data warehouse]] metadata systems are sometimes separated into two sections:
# '''back room''' metadata that are used for [[Extract, transform, load]] functions to get [[OLTP]] data into a data warehouse
# '''front room''' metadata that are used to label screens and create reports
Kimball lists the following types of metadata in a data warehouse (See also [http://www.fortunecity.com/skyscraper/oracle/699/orahtml/dbmsmag/9803d05.html]):
* [[source system]] metadata
** source specifications, such as [[repository|repositories]], and source [[logical schema]]s
** source descriptive information, such as ownership descriptions, update frequencies, legal limitations, and [[access method]]s
** process information, such as job schedules and extraction code
* [[data staging]] metadata
** [[data acquisition]] information, such as [[data transmission]] scheduling and results, and file usage
** [[dimension table]] management, such as definitions of dimensions, and [[surrogate key]] assignments
** [[Program transformation|transformation]] and [[aggregation]], such as [[data enhancement]] and mapping, [[DBMS]] load scripts, and aggregate definitions
** audit, job logs and documentation, such as [[data lineage]] records, [[data transform]] logs
* DBMS metadata, such as:
** DBMS system table contents
** processing hints
Michael Bracket defines metadata (what he calls "Data resource data") as "any data about the organization's data resource". Adrienne Tannenbaum defines metadata as "the detailed description of instance data. The format and characteristics of populated instance data: instances and values, dependent on the role of the metadata recipient". These definitions are characteristic of the "data about data" definition.
=== Business Intelligence metadata ===
[[Business Intelligence]] is the process of analyzing large amounts of corporate data, usually stored in large databases such as the [[Data Warehouse]], tracking business performance, detecting patterns and trends, and helping enterprise business users make better decisions.
Business Intelligence metadata describes how data is queried, filtered, analyzed, and displayed
in Business Intelligence software tools, such as Reporting tools, OLAP tools, Data Mining tools.
Examples:
* [[Online analytical processing|OLAP]] metadata: The descriptions and structures of Dimensions, Cubes, Measures (Metrics), Hierarchies, Levels, Drill Paths
* Reporting metadata: The descriptions and structures of Reports, Charts, Queries, DataSets, Filters, Variables, Expressions
* [[Data Mining]] metadata: The descriptions and structures of DataSets, Algorithms, Queries
Business Intelligence metadata can be used to understand how corporate financial reports reported to [[Wall Street]] are calculated, how the revenue, expense and profit are aggregated from individual sales transactions stored in the data warehouse.
A good understanding of Business Intelligence metadata is required to solve complex problems such as compliance with corporate governance standards, such as [[Sarbanes Oxley]] (SOX) or Basel II.
=== General IT metadata ===
In contrast, David Marco, another metadata theorist, defines metadata as "all physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation." Others have included web services, systems and interfaces. In fact, the entire [[Zachman framework]] (see [[Enterprise Architecture]]) can be represented as metadata.
Notice that such definitions expand metadata's scope considerably, to encompass most or all of the data required by the [[Management Information System]]s capability. In this sense, the concept of metadata has significant overlaps with the [[ITIL]] concept of a Configuration Management Database ([[CMDB]]), and also with disciplines such as [[Enterprise Architecture]] and [[IT portfolio management]].
This broader definition of metadata has precedent. Third generation corporate repository products (such as those eventually merged into the CA Advantage line) not only store information about data definitions (COBOL copybooks, DBMS schema), but also about the programs accessing those data structures, and the [[Job Control Language]] and batch job infrastructure dependencies as well. These products (some of which are still in production) can provide a very complete picture of a mainframe computing environment, supporting exactly the kinds of impact analysis required for ITIL-based processes such as [[ITIL#Incident Management|Incident]] and [[Change Management (ITIL)|Change Management]]. The [[ITIL]] [http://www.tso.co.uk/itil/ Back Catalogue] includes the ''Data Management'' volume which recognizes the role of these metadata products on the mainframe, posing the [[CMDB]] as the distributed computing equivalent. CMDB vendors however have generally not expanded their scope to include data definitions, and metadata solutions are also available in the distributed world. Determining the appropriate role and scope for each is thus a challenge for large IT organizations requiring the services of both.
Since metadata is pervasive, centralized attempts at tracking it need to focus on the most highly leveraged assets. Enterprise Assets may only constitute a small percentage of the entire IT portfolio.
Some practitioners have successfully managed IT metadata using the [[Dublin Core]] metamodel.
==== IT metadata management products ====
First generation data dictionary/metadata repository tools would be those only supporting a specific [[DBMS]], such as [[IDMS]]'s IDD (integrated data dictionary), the [[Information Management System|IMS]] Data Dictionary, and [[ADABAS]]'s Predict.
Second generation would be ASG's DATAMANAGER product which could support many different file and DBMS types.
Third generation repository products became briefly popular in the early 1990s along with the rise of widespread use of [[RDBMS]] engines such as IBM's [[IBM DB2|DB2]].
Fourth generation products link the repository with more [[Extract, transform, load]] tools and can be connected with architectural modeling tools. Examples include [http://www.adaptive.com/products/mm.html Adaptive Metadata Manager] from Adaptive, [http://www.asg.com/products/product_details.asp?code=ROC&id=50 Rochade] from ASG,[http://www.infolibcorp.com/productsOverview.html InfoLibrarian Metadata Integration Framework] and [[Troux Technologies]] Metis Server product.
=== File system metadata ===
Nearly all [[file system]]s keep metadata about files [[out-of-band]]. Some systems keep metadata in [[directory (file systems)|directory]] entries; others in specialized structure like [[inode]]s or even in the name of a file. Metadata can range from simple [[timestamp]]s, [[mode bit]]s, and other special-purpose information used by the implementation itself, to [[icon (computing)|icon]]s and free-text comments, to arbitrary [[attribute-value pair]]s.
With more complex and open-ended metadata, it becomes useful to search for files based on the metadata contents. The [[Unix]] [[find]] utility was an early example, although inefficient when scanning hundreds of thousands of files on a modern computer system. [[Apple Computer]]'s [[Mac OS X]] operating system supports cataloguing and searching for file metadata through a feature known as [[Spotlight (software)|Spotlight]], as of [[Mac OS X v10.4|version 10.4]]. [[Microsoft]] worked in the development of similar functionality with the [[Instant Search]] system in [[Windows Vista]], as well as being present in [[SharePoint Server]]. [[Linux]] implements file metadata using [[extended file attributes]].
=== Image metadata ===
Examples of image files containing metadata include [[Exchangeable image file format]] (EXIF) and [[Tagged Image File Format]] (TIFF).
Having metadata about images embedded in TIFF or EXIF files is one way of acquiring additional data about an image. [[Tag (metadata)|Tagging]] pictures with subjects, related emotions, and other descriptive phrases helps Internet users find pictures easily rather than having to search through entire image collections. A prime example of an image tagging service is [[Flickr]], where users upload images and then describe the contents. Other patrons of the site can then search for those tags. Flickr uses a [[folksonomy]]: a free-text keyword system in which the community defines the vocabulary through use rather than through a [[controlled vocabulary]].
Users can also tag photos for organization purposes using Adobe's [[Extensible Metadata Platform]] (XMP) language, for example.
Digital photography is increasingly making use of technical metadata tags describing the conditions of exposure. Photographers shooting [[RAW image format|Camera RAW]] file formats can use applications such as [[Adobe Bridge]] or Apple Computer's [[Aperture (photography software)|Aperture]] to work with camera metadata for post-processing.
=== Audio Metadata ===
Audio metadata generally relates to the how the data should be written in order for a processor to efficiently process it. These technologies are usually seen in Audio Engine Programming such as Microsoft [[Resource Interchange File Format|RIFF (Resource Interchange File Format)]] technologies for .wave file.
Codes generally develop their own metadata standards for compression purpose.
=== Program metadata ===
Metadata is casually used to describe the controlling data used in software architectures that are more abstract or configurable. Most '''[[executable|executable file]]''' formats include what may be termed "metadata" that specifies certain, usually configurable, behavioral [[runtime]] characteristics. However, it is difficult if not impossible to precisely distinguish program "metadata" from general aspects of [[Von Neumann architecture|stored-program computing architecture]]; if the machine reads it and acts upon it, it is a computational [[Instruction (computer science)|instruction]], and the prefix "meta" has little significance.
In [[Java (programming language)|Java]], the [[Class (file format)|class file format]] contains metadata used by the [[Java compiler]] and the [[Java virtual machine]] to [[dynamic linking|dynamically link]] [[class (computer science)|classes]] and to support [[reflection (computer science)|reflection]]. The [[J2SE]] 5.0 version of Java included a [[metadata facility for Java|metadata facility]] to allow additional annotations that are used by [[development tool]]s.
In [[MS-DOS]], the [[COM file]] format does ''not'' include metadata, while the [[EXE]] file and Windows [[Portable Executable|PE]] formats do. These metadata can include the company that published the program, the date the program was created, the version number and more.
In the [[.NET Framework|Microsoft .NET]] executable format, extra metadata is included to allow [[Reflection (computer science)|reflection]] at runtime.
=== Existing software metadata ===
[[Object Management Group]] (OMG) has defined metadata format for representing entire existing applications for the purposes of [[software mining]], [[software modernization]] and software assurance. This specification, called the OMG [[Knowledge Discovery Metamodel]] (KDM) is the OMG's foundation for "modeling in reverse". KDM is a common language-independent intermediate representation that provides an integrated view of an entire enterprise application, including its behavior (program flow), data, and structure. One of the applications of KDM is Business Rules Mining. [[Knowledge Discovery Metamodel]] includes a fine grained low-level representation (called "micro KDM"), suitable for performing static analysis of programs.
=== Document metadata ===
Most programs that create documents, including Microsoft [[SharePoint]], [[Microsoft Office Word|Microsoft Word]] and other [[Microsoft Office]] products, save metadata with the document files. These metadata can contain the name of the person who created the file (obtained from the operating system), the name of the person who last edited the file, how many times the file has been printed, and even how many revisions have been made on the file. Other saved material, such as deleted text (saved in case of an undelete command), document comments and the like, is also commonly referred to as "metadata", and the inadvertent inclusion of this material in distributed files has sometimes led to undesirable disclosures.
Document Metadata is particularly important in legal environments where litigation can request this sensitive information (metadata) which can include many elements of private detrimental data. This data has been linked to multiple lawsuits that have got corporations into legal complications.
Many legal firms today use "Metadata Management Software", also known as "Metadata Removal Tools". This software can be used to clean documents before they are sent outside of their firm. This process, known as metadata management, protects lawfirms from potentially unsafe leaking of sensitive data through [[Electronic Discovery]].
For a list of executable formats, see [[object file]].
=== Metamodels ===
Metadata on Models are called [[Metamodel]]s. In [[Model Driven Engineering]], a [[Model (abstract)|Model]] has to conform to a given [[Metamodel]]. According to the [[model-driven architecture|MDA]] guide, a metamodel is a model and each model conforms to a given metamodel. [[Meta-modeling]] allows strict and agile automatic processing of models and metamodels.
The [[Object Management Group]] (OMG) defines 4 layers of meta-modeling. Each level of modeling is defined, validated by the next layer:
*M0: instance object, data row, record -> "John Smith"
* M1: model, schema -> "Customer" UML Class or database Table
* M2: metamodel -> [[Unified Modeling Language]] (UML), [[Common Warehouse Metamodel]] (CWM), [[Knowledge Discovery Metamodel]] (KDM)
* M3: meta-metamodel -> [[Meta-Object Facility]] (MOF)
=== Meta-metadata ===
Since metadata are also data, it is possible to have metadata of metadata–"meta-metadata." Machine-generated meta-metadata, such as the reversed index created by a free-text search engine, is generally not considered metadata, though.
=== Digital library metadata ===
There are three categories of metadata that are frequently used to describe objects in a digital library:
# '''descriptive''' - Information describing the intellectual content of the object, such as [[MARC]] cataloguing records, finding aids or similar schemes. It is typically used for bibliographic purposes and for search and retrieval.
# '''structural''' - Information that ties each object to others to make up logical units (e.g., information that relates individual images of pages from a book to the others that make up the book).
# '''administrative''' - Information used to manage the object or control access to it. This may include information on how it was scanned, its storage format, [[copyright]] and licensing information, and information necessary for the [[digital preservation|long-term preservation]] of the digital objects.
=== Geospatial metadata ===
Metadata that describe geographic objects (such as datasets, maps, features, or simply documents with a geospatial component) have a history going back to at least 1994 (refer [http://libraries.mit.edu/guides/subjects/metadata/standards/fgdc.html MIT Library page on FGDC Metadata]). This class of metadata is described more fully on the [[Geospatial metadata]] page.
Microsoft Windows
'''Microsoft Windows''' is a series of [[software]] [[operating system]]s produced by [[Microsoft]]. Microsoft first introduced an operating environment named ''Windows'' in November 1985 as an add-on to [[MS-DOS]] in response to the growing interest in [[graphical user interface]]s (GUIs). Microsoft Windows came to [[Market dominance|dominate]] the world's [[personal computer]] market, overtaking [[Mac OS]], which had been introduced previously. At the 2004 [[International Data Corporation|IDC]] Directions conference, it was stated that Windows had approximately 90% of the [[Client (computing)|client]] operating system market. The most recent client version of Windows is [[Windows Vista]]; the current [[Server (computing)|server]] version is [[Windows Server 2008]].
==Versions==
The term ''Windows'' collectively describes any or all of several generations of Microsoft (MS) operating system (OS) products. These products are generally categorized as follows:
===16-bit operating environments===
The early versions of Windows were often thought of as just graphical user interfaces, mostly because they ran on top of [[MS-DOS]] and used it for [[file system]] services. However, even the earliest 16-bit Windows versions already assumed many typical operating system functions, notably, having their own [[executable file format]] and providing their own [[device driver]]s (timer, graphics, printer, mouse, keyboard and sound) for applications. Unlike [[MS-DOS]], Windows allowed users to execute multiple graphical applications at the same time, through [[computer multitasking|cooperative multitasking]]. Finally, Windows implemented an elaborate, segment-based, software virtual memory scheme, which allowed it to run applications larger than available memory: code segments and [[resource (Windows)|resource]]s were swapped in and thrown away when memory became scarce, and data segments moved in memory when a given application had relinquished processor control, typically waiting for user input. 16-bit Windows versions include [[Windows 1.0]] (1985), [[Windows 2.0]] (1987) and its close relatives, ''[[Windows 2.1x|Windows/286-Windows/386]]''.
===Hybrid 16/32-bit operating environments===
[[Windows 2.1x|Windows/386]] introduced a 32-bit [[protected mode]] [[kernel (computer science)|kernel]] and [[virtual machine]] monitor. For the duration of a Windows session, it created one or more [[virtual 8086 mode|virtual 8086 environments]] and provided device virtualization for the video card, keyboard, mouse, timer and [[interrupt]] controller inside each of them. The user-visible consequence was that it became possible to preemptively multitask multiple MS-DOS environments in separate windows, although graphical MS-DOS applications required full screen mode. Also, Windows applications were multi-tasked cooperatively inside one such virtual 8086 environment. [[Windows 3.0]] (1990) and [[Windows 3.1x|Windows 3.1]] (1992) improved the design, mostly because of [[virtual memory]] and loadable virtual device drivers ([[VxD]]s) which allowed them to share arbitrary devices between multitasked DOS windows. Also, Windows applications could now run in protected mode (when Windows was running in Standard or 386 Enhanced Mode), which gave them access to several megabytes of memory and removed the obligation to participate in the software virtual memory scheme. They still ran inside the same address space, where the segmented memory provided a degree of protection, and multi-tasked cooperatively. For Windows 3.0, Microsoft also rewrote critical operations from [[C (programming language)|C]] into [[Assembly language|assembly]], making this release faster and less memory-hungry than its predecessors.
===Hybrid 16/32-bit operating systems===
With the introduction of the [[32-bit]] [[Windows 3.1x|Windows for Workgroups 3.11]], Windows was able to stop relying on DOS for file management. Leveraging this, [[Windows 95]] introduced [[Long filename|Long File Names]], reducing the [[8.3 filename]] DOS environment to the role of a [[boot loader]]. MS-DOS was now bundled with Windows; this notably made it (partially) aware of long file names when its utilities were run from within Windows. The most important novelty was the possibility of running 32-bit multi-threaded preemptively multitasked graphical programs. However, the necessity of keeping compatibility with 16-bit programs meant the GUI components were still 16-bit only and not fully reentrant, which resulted in reduced performance and stability.
There were three releases of Windows 95 (the first in 1995, then subsequent bug-fix versions in 1996 and 1997, only released to OEMs, which added extra features such as [[File Allocation Table|FAT32]] and primitive USB support). Microsoft's next OS was [[Windows 98]]; there were two versions of this (the first in 1998 and the second, named "Windows 98 Second Edition", in 1999). In 2000, Microsoft released [[Windows Me]] (''Me'' standing for ''Millennium Edition''), which used the same core as Windows 98 but adopted some aspects of Windows 2000 and removed the option boot into DOS mode. It also added a new feature called System Restore, allowing the user to set the computer's settings back to an earlier date.
===32-bit operating systems===
The NT family of Windows systems was fashioned and marketed for higher reliability business use, and was unencumbered by any Microsoft DOS patrimony. The first release was [[Windows NT 3.1]] (1993, numbered "3.1" to match the Windows version and to one-up [[OS/2]] 2.1, IBM's flagship OS co-developed by Microsoft and was Windows NT's main competitor at the time), which was followed by [[Windows NT 3.5|NT 3.5]] (1994), [[Windows NT 3.51|NT 3.51]] (1995), [[Windows NT 4.0|NT 4.0]] (1996), and [[Windows 2000]] (essentially NT 5.0). NT 4.0 was the first in this line to implement the "Windows 95" user interface (and the first to include Windows 95's built-in 32-bit runtimes). Microsoft then moved to combine their consumer and business operating systems. [[Windows XP]], coming in both home and professional versions (and later niche market versions for [[tablet PC]]s and [[media center]]s) improved stability, user experience and backwards compatibility. Then, [[Windows Server 2003]] brought [[Windows Server]] up to date with Windows XP. Since then, a new version, [[Windows Vista]] was released and [[Windows Server 2008]], released on [[February 27]], [[2008]], brings [[Windows Server]] up to date with [[Windows Vista]]. [[Windows CE]], Microsoft's offering in the mobile and embedded markets, is also a true 32-bit operating system that offers various services for all sub-operating workstations.
===64-bit operating systems===
[[Windows NT]] included support for several different platforms before the [[X86 architecture|x86]]-based [[personal computer]] became dominant in the professional world. Versions of NT from 3.1 to 4.0 variously supported [[PowerPC]], [[DEC Alpha]] and [[MIPS Technologies|MIPS]] R4000, some of which were 64-bit processors, although the operating system treated them as 32-bit processors.
With the introduction of the [[Intel]] [[Itanium]] architecture, which is referred to as [[IA-64]], Microsoft released new versions of Windows to support it. Itanium versions of [[Windows XP]] and [[Windows Server 2003]] were released at the same time as their mainstream x86 (32-bit) counterparts. On [[April 25]] [[2005]], Microsoft released [[Windows XP Professional x64 Edition]] and x64 versions of Windows Server 2003 to support the [[x86-64|AMD64/Intel64]] (or ''x64'' in Microsoft terminology) architecture. Microsoft dropped support for the Itanium version of Windows XP in 2005. [[Windows Vista]] is the first end-user version of Windows that Microsoft has released simultaneously in 32-bit and x64 editions. Windows Vista does not support the Itanium architecture. The modern 64-bit Windows family comprises AMD64/Intel64 versions of [[Windows Vista]], and [[Windows Server 2003]] and [[Windows Server 2008]], in both Itanium and x64 editions.
==History==
Microsoft has taken two parallel routes in its operating systems. One route has been for the home user and the other has been for the professional IT user. The dual routes have generally led to home versions having greater [[multimedia]] support and less functionality in networking and security, and professional versions having inferior multimedia support and better networking and security.
The first version of Microsoft Windows, [[Windows 1.0|version 1.0]], released in November 1985, lacked a degree of functionality and achieved little popularity, and was to compete with Apple's own operating system. Windows 1.0 is not a complete operating system; rather, it extends MS-DOS. Microsoft Windows version 2.0 was released in November, 1987 and was slightly more popular than its predecessor. Windows 2.03 (release date January 1988) had changed the OS from tiled windows to overlapping windows. The result of this change led to Apple Computer filing a suit against Microsoft alleging infringement on Apple's copyrights.
Microsoft Windows version 3.0, released in 1990, was the first Microsoft Windows version to achieve broad commercial success, selling 2 million copies in the first six months.[http://www.islandnet.com/~kpolsson/compsoft/soft1991.htm][http://www.thocp.net/companies/microsoft/microsoft_company.htm] It featured improvements to the user interface and to multitasking capabilities. It received a facelift in Windows 3.1, made generally available on [[March 1]], [[1992]]. Windows 3.1 support ended on [[December 31]], [[2001]].
In July 1993, Microsoft released [[Windows NT]] based on a new kernel. NT was considered to be the professional OS and was the first Windows version to utilize [[preemptive multitasking]].. Windows NT would later be retooled to also function as a home operating system, with Windows XP.
On August 24th 1995, Microsoft released [[Windows 95]], a new, and major, consumer version that made further changes to the user interface, and also used [[preemptive multitasking]]. Windows 95 was designed to replace not only Windows 3.1, but also Windows for Workgroups, and MS-DOS. It was also the first Windows operating system to use Plug and Play capabilities. The changes Windows 95 brought to the desktop were revolutionary, as opposed to evolutionary, such as those in Windows 98 and Windows Me. Mainstream support for [[Windows 95]] ended on [[December 31]], [[2000]] and extended support for [[Windows 95]] ended on [[December 31]], [[2001]].
The next in the consumer line was Microsoft [[Windows 98]] released on June 25th, 1998. It was substantially criticized for its slowness and for its unreliability compared with [[Windows 95]], but many of its basic problems were later rectified with the release of [[Windows 98]] Second Edition in 1999. Mainstream support for [[Windows 98]] ended on [[June 30]], [[2002]] and extended support for [[Windows 98]] ended on [[July 11]], [[2006]].
As part of its "professional" line, Microsoft released [[Windows 2000]] in February 2000. The consumer version following Windows 98 was [[Windows Me]] (Windows Millennium Edition). Released in September 2000, [[Windows Me]] implemented a number of new technologies for Microsoft: most notably publicized was "[[Universal Plug and Play]]."
In October 2001, Microsoft released [[Windows XP]], a version built on the Windows NT [[Kernel (computer science)|kernel]] that also retained the consumer-oriented usability of Windows 95 and its successors. This new version was widely praised in computer magazines. It shipped in two distinct editions, "Home" and "Professional", the former lacking many of the superior security and networking features of the Professional edition. Additionally, the first "Media Center" edition was released in 2002, with an emphasis on support for DVD and TV functionality including program recording and a remote control. Mainstream support for [[Windows XP]] will continue until [[April 14]], [[2009]] and extended support will continue until [[April 8]], [[2014]].
In April 2003, [[Windows Server 2003]] was introduced, replacing the [[Windows 2000]] line of server products with a number of new features and a strong focus on security; this was followed in December 2005 by Windows Server 2003 R2.
On [[January 30]], [[2007]] Microsoft released [[Windows Vista]]. It contains a number of [[Features new to Windows Vista|new features]], from a redesigned shell and user interface to significant [[Technical features new to Windows Vista|technical changes]], with a particular focus on [[Security and safety features new to Windows Vista|security features]]. It is available in a number of [[Windows Vista editions and pricing|different editions]], and has been subject to [[Criticism of Windows Vista|some criticism]].
==Timeline of releases==
==Security==
[[Computer security|Security]] has been a hot topic with Windows for many years, and even Microsoft itself has been the victim of security breaches. Consumer versions of Windows were originally designed for ease-of-use on a single-user PC without a network connection, and did not have security features built in from the outset. [[Windows NT]] and its successors are designed for security (including on a network) and multi-user PCs, but are not designed with Internet security in mind as much since, when it was first developed in the early 1990s, Internet use was less prevalent. These design issues combined with flawed code (such as [[buffer overflow]]s) and the popularity of Windows means that it is a frequent target of [[computer worm|worm]] and [[computer virus|virus]] writers. In June 2005, [[Bruce Schneier]]'s ''Counterpane Internet Security'' reported that it had seen over 1,000 new viruses and worms in the previous six months.
Microsoft releases security patches through its [[Windows Update]] service approximately once a month (usually the second Tuesday of the month), although critical updates are made available at shorter intervals when necessary. In Windows 2000 (SP3 and later), Windows XP and Windows Server 2003, updates can be automatically downloaded and installed if the user selects to do so. As a result, Service Pack 2 for Windows XP, as well as Service Pack 1 for Windows Server 2003, were installed by users more quickly than it otherwise might have been.
===Windows Defender===
On [[6 January]] [[2005]], Microsoft released a beta version of Microsoft AntiSpyware, based upon the previously released [[GIANT Company Software|Giant]] AntiSpyware. On [[14 February]], [[2006]], Microsoft AntiSpyware became [[Windows Defender]] with the release of beta 2. Windows Defender is a freeware program designed to protect against spyware and other unwanted software. [[Windows XP]] and [[Windows Server 2003]] users who have [[Windows Genuine Advantage|genuine]] copies of Microsoft Windows can freely download the program from Microsoft's web site, and Windows Defender ships as part of [[Windows Vista]].
===Third-party analysis===
In an article based on a report by Symantec, internetnews.com has described Microsoft Windows as having the "fewest number of patches and the shortest average patch development time of the five operating systems it monitored in the last six months of 2006." And the number of vulnerabilities found in Windows has significantly increased— Windows: 12+, Red Hat + Fedora: 2, Mac OS X: 1, HP-UX: 2, Solaris: 1.
A study conducted by [[Kevin Mitnick]] and marketing communications firm Avantgarde in 2004 found that an unprotected and unpatched Windows XP system with Service Pack 1 lasted only 4 minutes on the Internet before it was compromised, and an unprotected and also unpatched [[Windows Server 2003]] system was compromised after being connected to the internet for 8 hours. However, it is important to note that this study does not apply to Windows XP systems running the Service Pack 2 update (released in late 2004), which vastly improved the security of Windows XP. The computer that was running Windows XP Service Pack 2 was not compromised. The [[AOL]] National Cyber Security Alliance Online Safety Study of October 2004 determined that 80% of Windows users were infected by at least one [[spyware]]/[[adware]] product. Much documentation is available describing how to increase the security of Microsoft Windows products. Typical suggestions include deploying Microsoft Windows behind a hardware or software [[firewall]], running [[anti-virus]] and [[anti-spyware]] software, and installing patches as they become available through [[Windows Update]].
==Windows Lifecycle Policy==
Microsoft has stopped releasing updates and hotfixes for many old Windows operating systems, including all versions of Windows 9x and earlier versions of Windows NT. Windows versions prior to [[Windows XP|XP]] are no longer supported, with the exception of [[Windows 2000]], which is currently in the Extended Support Period, that will end on [[July 13]], [[2010]]. Windows XP versions prior to SP2 are no longer supported either. Also, support for [[Windows XP 64-bit Edition]] ended after the release of the more recent [[Windows XP Professional x64 Edition]]. No new updates are created for unsupported versions of Windows.
==Emulation software==
Emulation allows the use of some Windows applications without using Microsoft Windows. These include:
* [[Wine (software)|Wine]] - a [[free and open source software]] implementation of the [[Windows API]], allowing one to run many Windows applications on x86-based platforms, including [[Linux]]. Wine is technically not an emulator but a "compatibility layer"; while an emulator effectively 'pretends' to be a different CPU, Wine instead makes use of Windows-style APIs to 'simulate' the Windows environment directly.
** [[CrossOver]] - A Wine package with licensed fonts. Its developers are regular contributors to Wine, and focus on Wine running officially supported applications.
** [[Cedega]] - [[TransGaming Technologies]]' proprietary [[Fork (software development)|fork]] of Wine, designed specifically for running games written for Microsoft Windows under Linux.
** [[Darwine]] - This project intends to port and develop Wine as well as other supporting tools that will allow [[Darwin (operating system)|Darwin]] and [[Mac OS X]] users to run Microsoft Windows applications, and to provide [[Win32]] [[Application Programming Interface|API]] compatibility at application source code level.
* [[ReactOS]] - An open-source OS that is intended to run the same software as Windows, originally designed to imitate Windows NT 4.0, now aiming at Windows XP compatibility. It has been in the [[development stage]] since 1996.
Morphology (linguistics)
'''Morphology''' is the field of [[linguistics]] that studies the internal structure of words. (Words as units in the lexicon are the subject matter of [[lexicology]].) While words are generally accepted as being (with [[clitic]]s) the smallest units of [[syntax]], it is clear that in most (if not all) languages, words can be related to other words by rules. For example, [[English language|English]] speakers recognize that the words ''dog'', ''dogs'', and ''dog-catcher'' are closely related. English speakers recognize these relations from their tacit knowledge of the rules of word-formation in English. They intuit that ''dog'' is to ''dogs'' as ''cat'' is to ''cats''; similarly, ''dog'' is to ''dog-catcher'' as ''dish'' is to ''dishwasher''. The rules understood by the speaker reflect specific patterns (or regularities) in the way words are formed from smaller units and how those smaller units interact in speech. In this way, morphology is the branch of linguistics that studies patterns of word-formation within and across languages, and attempts to formulate rules that model the knowledge of the speakers of those languages.
==History ==
The history of morphological analysis dates back to the [[ancient India]]n linguist , who formulated the 3,959 rules of [[Sanskrit]] morphology in the text by using a Constituency Grammar. The Graeco-Roman grammatical tradition also engaged in morphological analysis.
The term ''morphology'' was coined by [[August Schleicher]] in [[1859]]
== Fundamental concepts ==
=== Lexemes and word forms ===
The distinction between these two senses of "word" is arguably the most important one in morphology. The first sense of "word," the one in which ''dog'' and ''dogs'' are "the same word," is called '''[[lexeme]]'''. The second sense is called '''word-form'''. We thus say that ''dog'' and ''dogs'' are different forms of the same lexeme. ''Dog'' and ''dog-catcher'', on the other hand, are different lexemes; for example, they refer to two different kinds of entities. The form of a word that is chosen conventionally to represent the canonical form of a word is called a [[lemma (linguistics)|lemma]], or '''citation form'''.
==== Prosodic word vs. morphological word ====
Here are examples from other languages of the failure of a single phonological word to coincide with a single morphological word-form. In Latin, one way to express the concept of 'NOUN-PHRASE1 and NOUN-PHRASE2' (as in "apples and oranges") is to suffix '-que' to the second noun phrase: "apples oranges-and", as it were. An extreme level of this theoretical quandary posed by some phonological words is provided by the Kwak'wala language. In Kwak'wala, as in a great many other languages, meaning relations between nouns, including possession and "semantic case", are formulated by affixes instead of by independent "words". The three word English phrase, "with his club", where 'with' identifies its dependent noun phrase as an instrument and 'his' denotes a possession relation, would consist of two words or even just one word in many languages. But affixation for semantic relations in Kwak'wala differs dramatically (from the viewpoint of those whose language is not Kwak'wala) from such affixation in other languages for this reason: the affixes phonologically attach not to the lexeme they pertain to semantically, but to the ''preceding'' lexeme. Consider the following example (in Kwakw'ala, sentences begin with what corresponds to an English verb):
kwixʔid-i-da bəgwanəmai-χ-a q'asa-s-isi t'alwagwayu
Morpheme by morpheme translation:
kwixʔid-i-da = clubbed-PIVOT-DETERMINER
bəgwanəma-χ-a = man-ACCUSATIVE-DETERMINER
q'asa-s-is = otter-INSTRUMENTAL-3.PERSON.SINGULAR-POSSESSIVE
t'alwagwayu = club.
"the man clubbed the otter with his club"
(Notation notes:
1. accusative case marks an entity that something is done to.
2. determiners are words such as "the", "this", "that".
3. the concept of "pivot" is a theoretical construct that is not relevant to this discussion.)
That is, to the speaker of Kwak'wala, the sentence does not contain the "words" 'him-the-otter' or 'with-his-club' Instead, the markers -''i-da'' (PIVOT-'the'), referring to ''man'', attaches not to ''bəgwanəma'' ('man'), but instead to the "verb"; the markers -''χ-a'' (ACCUSATIVE-'the'), referring to ''otter'', attach to ''bəgwanəma'' instead of to ''q'asa'' ('otter'), etc. To summarize differently: a speaker of Kwak'wala does ''not'' perceive the sentence to consist of these phonological words:
kwixʔid i-da-bəgwanəma χ-a-q'asa s-isi-t'alwagwayu
"clubbed PIVOT-the-mani hit-the-otter with-hisi-club
A central publication on this topic is the recent volume edited by Dixon and Aikhenvald (2007), examining the mismatch between prosodic-phonological and grammatical definitions of "word" in various Amazonian, Australian Aboriginal, Caucasian, Eskimo, Indo-European, Native North American, and West African languages, and in sign languages. Apparently, a wide variety of languages make use of the hybrid linguistic unit clitic, possessing the grammatical features of independent words but the prosodic-phonological lack of freedom of bound morphemes. The intermediate status of clitics poses a considerable challenge to linguistic theory.
=== Inflection vs. word-formation ===
Given the notion of a lexeme, it is possible to distinguish two kinds of morphological rules. Some morphological rules relate to different forms of the same lexeme; while other rules relate to different lexemes. Rules of the first kind are called '''[[Inflection|inflectional rules]]''', while those of the second kind are called '''[[word formation|word-formation]]'''. The English plural, as illustrated by ''dog'' and ''dogs'', is an inflectional rule; compounds like ''dog-catcher'' or ''dishwasher'' provide an example of a word-formation rule. Informally, word-formation rules form "new words" (that is, new lexemes), while inflection rules yield variant forms of the "same" word (lexeme).
There is a further distinction between two kinds of word-formation: [[Derivation (linguistics)|derivation]] and [[Compound (linguistics)|compounding]]. Compounding is a process of word-formation that involves combining complete word-forms into a single '''compound''' form; ''dog-catcher'' is therefore a compound, because both ''dog'' and ''catcher'' are complete word-forms in their own right before the compounding process has been applied, and are subsequently treated as one form. Derivation involves [[affix]]ing [[bound morpheme|bound]] (non-independent) forms to existing lexemes, whereby the addition of the affix '''derives''' a new lexeme. One example of derivation is clear in this case: the word ''independent'' is derived from the word ''dependent'' by prefixing it with the derivational prefix ''in-'', while ''dependent'' itself is derived from the verb ''depend''.
The distinction between inflection and word-formation is not at all clear-cut. There are many examples where linguists fail to agree whether a given rule is inflection or word-formation. The next section will attempt to clarify this distinction.
=== Paradigms and morphosyntax ===
A '''paradigm''' is the complete set of related word-forms associated with a given lexeme. The familiar examples of paradigms are the [[Grammatical conjugation|conjugations]] of verbs, and the [[declension]]s of nouns. Accordingly, the word-forms of a lexeme may be arranged conveniently into tables, by classifying them according to shared inflectional categories such as [[grammatical tense|tense]], [[grammatical aspect|aspect]], [[grammatical mood|mood]], [[grammatical number|number]], [[grammatical gender|gender]] or [[grammatical case|case]]. For example, the personal pronouns in English can be organized into tables, using the categories of person (1st., 2nd., 3rd.), number (singular vs. plural), gender (masculine, feminine, neuter), and [[grammatical case|case]] (subjective, objective, and possessive). See [[English personal pronouns]] for the details.
The inflectional categories used to group word-forms into paradigms cannot be chosen arbitrarily; they must be categories that are relevant to stating the [[syntax|syntactic rules]] of the language. For example, person and number are categories that can be used to define paradigms in English, because English has [[Agreement (linguistics)|grammatical agreement]] rules that require the verb in a sentence to appear in an inflectional form that matches the person and number of the subject. In other words, the syntactic rules of English care about the difference between ''dog'' and ''dogs'', because the choice between these two forms determines which form of the verb is to be used. In contrast, however, no syntactic rule of English cares about the difference between ''dog'' and ''dog-catcher'', or ''dependent'' and ''independent''. The first two are just nouns, and the second two just adjectives, and they generally behave like any other noun or adjective behaves.
An important difference between inflection and word-formation is that inflected word-forms of lexemes are organized into paradigms, which are defined by the requirements of syntactic rules, whereas the rules of word-formation are not restricted by any corresponding requirements of syntax. Inflection is therefore said to be relevant to syntax, and word-formation is not. The part of morphology that covers the relationship between [[syntax]] and morphology is called morphosyntax, and it concerns itself with inflection and paradigms, but not with word-formation or compounding.
=== Allomorphy ===
In the exposition above, morphological rules are described as analogies between word-forms: ''dog'' is to ''dogs'' as ''cat'' is to ''cats'', and as ''dish'' is to ''dishes''. In this case, the analogy applies both to the form of the words and to their meaning: in each pair, the first word means "one of X", while the second "two or more of X", and the difference is always the plural form ''-s'' affixed to the second word, signaling the key distinction between singular and plural entities.
One of the largest sources of complexity in morphology is that this one-to-one correspondence between meaning and form scarcely applies to every case in the language. In English, we have word form pairs like ''ox/oxen'', ''goose/geese'', and ''sheep/sheep'', where the difference between the singular and the plural is signaled in a way that departs from the regular pattern, or is not signaled at all. Even cases considered "regular", with the final ''-s'', are not so simple; the ''-s'' in ''dogs'' is not pronounced the same way as the ''-s'' in ''cats'', and in a plural like ''dishes'', an "extra" vowel appears before the ''-s''. These cases, where the same distinction is effected by alternative forms of a "word", are called '''[[allomorph]]y'''.
Phonological rules constrain which sounds can appear next to each other in a language, and morphological rules, when applied blindly, would often violate phonological rules, by resulting in sound sequences that are prohibited in the language in question. For example, to form the plural of ''dish'' by simply appending an ''-s'' to the end of the word would result in the form *{{IPA|[dɪʃs]}}, which is not permitted by the [[phonotactics]] of English. In order to "rescue" the word, a vowel sound is inserted between the root and the plural marker, and {{IPA|[dɪʃəz]}} results. Similar rules apply to the pronunciation of the ''-s'' in ''dogs'' and ''cats'': it depends on the quality (voiced vs. unvoiced) of the final preceding [[phoneme]].
=== Lexical morphology ===
[[Lexical morphology]] is the branch of morphology that deals with the [[lexicon]], which, morphologically conceived, is the collection of [[lexeme]]s in a language. As such, it concerns itself primarily with word-formation: derivation and compounding.
== Models of morphology ==
There are three principal approaches to morphology, which each try to capture the distinctions above in different ways. These are,
* [[Morpheme-based morphology]], which makes use of an [[Item-and-Arrangment (Morphology)|Item-and-Arrangement]] approach.
* [[Lexeme-based morphology]], which normally makes use of an [[Item-and-Process (Morphology)|Item-and-Process]] approach.
* [[Word-based morphology]], which normally makes use of a [[Word-and-paradigm morphology|Word-and-Paradigm]] approach.
Note that while the associations indicated between the concepts in each item in that list is very strong, it is not absolute.
=== Morpheme-based morphology ===
In [[morpheme-based morphology]], word-forms are analyzed as arrangements of [[morpheme]]s. A '''morpheme''' is defined as the minimal meaningful unit of a language. In a word like ''independently'', we say that the morphemes are ''in-'', ''depend'', ''-ent'', and ''ly''; ''depend'' is the [[root (linguistics)|root]] and the other morphemes are, in this case, derivational affixes. In a word like ''dogs'', we say that ''dog'' is the root, and that ''-s'' is an inflectional morpheme. This way of analyzing word-forms as if they were made of morphemes put after each other like beads on a string, is called [[Item-and-Arrangment (Morphology)|Item-and-Arrangement]].
The morpheme-based approach is the first one that beginners to morphology usually think of, and which laymen tend to find the most obvious. This is so to such an extent that very often beginners think that morphemes are an inevitable, fundamental notion of morphology, and many five-minute explanations of morphology are, in fact, five-minute explanations of morpheme-based morphology. This is, however, not so. The fundamental idea of morphology is that the words of a language are related to each other by different kinds of rules. Analyzing words as sequences of morphemes is a way of describing these relations, but is not the only way. In actual academic linguistics, morpheme-based morphology certainly has many adherents, but is by no means the dominant approach.
=== Lexeme-based morphology ===
[[Lexeme-based morphology]] is (usually) an [[Item-and-Process (Morphology)|Item-and-Process]] approach. Instead of analyzing a word-form as a set of morphemes arranged in sequence, a word-form is said to be the result of applying rules that ''alter'' a word-form or stem in order to produce a new one. An inflectional rule takes a stem, changes it as is required by the rule, and outputs a word-form; a derivational rule takes a stem, changes it as per its own requirements, and outputs a derived stem; a compounding rule takes word-forms, and similarly outputs a compound stem.
=== Word-based morphology ===
[[Word-based morphology]] is a (usually) [[Word-and-paradigm morphology|Word-and-paradigm]] approach. This theory takes paradigms as a central notion. Instead of stating rules to combine morphemes into word-forms, or to generate word-forms from stems, word-based morphology states generalizations that hold between the forms of inflectional paradigms. The major point behind this approach is that many such generalizations are hard to state with either of the other approaches. The examples are usually drawn from [[fusional language]]s, where a given "piece" of a word, which a morpheme-based theory would call an inflectional morpheme, corresponds to a combination of grammatical categories, for example, "third person plural." Morpheme-based theories usually have no problems with this situation, since one just says that a given morpheme has two categories. Item-and-Process theories, on the other hand, often break down in cases like these, because they all too often assume that there will be two separate rules here, one for third person, and the other for plural, but the distinction between them turns out to be artificial. Word-and-Paradigm approaches treat these as whole words that are related to each other by [[analogy|analogical]] rules. Words can be categorized based on the pattern they fit into. This applies both to existing words and to new ones. Application of a pattern different than the one that has been used historically can give rise to a new word, such as ''older'' replacing ''elder'' (where ''older'' follows the normal pattern of [[adjective|adjectival]] [[superlative]]s) and ''cows'' replacing ''kine'' (where ''cows'' fits the regular pattern of plural formation). While a Word-and-Paradigm approach can explain this easily, other approaches have difficulty with phenomena such as this.
== Morphological typology ==
In the 19th century, philologists devised a now classic classification of languages according to their morphology. According to this typology, some languages are [[isolating language|isolating]], and have little to no morphology; others are [[agglutinating language|agglutinative]], and their words tend to have lots of easily-separable morphemes; while others yet are inflectional or [[fusional language|fusional]], because their inflectional morphemes are said to be "fused" together. This leads to one bound morpheme conveying multiple pieces of information. The classic example of an isolating language is [[Chinese language|Chinese]]; the classic example of an agglutinative language is [[Turkish language|Turkish]]; both [[Latin language|Latin]] and [[Greek language|Greek]] are classic examples of fusional languages.
Considering the variability of the world's languages, it becomes clear that this classification is not at all clear-cut, and many languages do not neatly fit any one of these types, and some fit in more than one. A continuum of complex morphology of language may be adapted when considering languages.
The three models of morphology stem from attempts to analyze languages that more or less match different categories in this typology. The Item-and-Arrangement approach fits very naturally with agglutinative languages; while the Item-and-Process and Word-and-Paradigm approaches usually address fusional languages.
The reader should also note that the classical typology also mostly applies to inflectional morphology. There is very little fusion going on with word-formation. Languages may be classified as synthetic or analytic in their word formation, depending on the preferred way of expressing notions that are not inflectional: either by using word-formation (synthetic), or by using syntactic phrases (analytic).
Named entity recognition
'''Named entity recognition''' (NER) (also known as '''entity identification (EI)''' and '''entity extraction''') is a subtask of [[information extraction]] that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
For example, a NER system producing [[Message Understanding Conference|MUC]]-style output might [[Metadata|tag]] the sentence,
:''Jim bought 300 shares of Acme Corp. in 2006.''
:''''''''Jim'''''' bought ''''''300'''''' shares of ''''''Acme Corp.'''''' in ''''''2006''''''''.
NER systems have been created that use linguistic [[formal grammar|grammar]]-based techniques as well as [[statistical model]]s. Hand-crafted grammar-based systems typically obtain better results, but at the cost of months of work by experienced [[Linguistics|linguists]]. Statistical NER systems typically require a large amount of manually [[annotation|annotated]] training data.
Since about 1998, there has been a great deal of interest in entity identification in the [[molecular biology]], [[bioinformatics]], and medical [[natural language processing]] communities. The most common entity of interest in that domain has been names of genes and gene products.
==Named entity types==
In the expression ''named entity'', the word ''named'' restricts the task to those entities for which one or many [[rigid designator]]s, as defined by [[Saul Kripke|Kripke]], stands for the referent. For instance, the ''automotive company created by Henry Ford in 1903'' is referred to as ''Ford'' or ''Ford Motor Company''. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.
There is a general agreement to include [[temporal expressions]] and some numerical expressions such as money and measures in named entities. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year ''2001'' refers to the ''2001st year of the Gregorian calendar''. In the second case, the month ''June'' may refer to the month of an undefined year (''past June'', ''next June'', ''June 2020'', etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons.
At least two [[Hierarchy|hierarchies]] of named entity types have been proposed in the literature. [[BBN Technologies|BBN]] categories [http://www.ldc.upenn.edu/Catalog/docs/LDC2005T33/BBN-Types-Subtypes.html], proposed in 2002, is used for [[Question Answering]] and consists of 29 types and 64 subtypes. Sekine's extended hierarchy [http://nlp.cs.nyu.edu/ene/], proposed in 2002, is made of 200 subtypes.
==Evaluation==
Benchmarking and evaluations have been performed in the ''[[Message Understanding Conference]]s'' (MUC) organized by [[DARPA]], ''International Conference on Language Resources and Evaluation (LREC)'', ''Computational Natural Language Learning ([[CoNLL]])'' workshops, ''Automatic Content Extraction'' (ACE) organized by [[NIST]], the ''[[Multilingual Entity Task Conference]]'' (MET), ''Information Retrieval and Extraction Exercise'' (IREX) and in ''HAREM'' (Portuguese language only).
[http://aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_%28State_of_the_art%29 State-of-the-art systems] produce near-human performance. For instance, the best system entering [http://www.itl.nist.gov/iad/894.02/related_projects/muc/proceedings/muc_7_toc.html MUC-7] scored 93.39% of [[Information_retrieval#F-measure|f-measure]] while human annotators scored 97.60% and 96.95%.
Natural language
In the [[philosophy of language]], a '''natural language''' (or '''ordinary language''') is a [[language]] that is spoken, [[writing|written]], or [[Sign language|signed]] by [[animal]]s for general-purpose communication, as distinguished from [[formal language]]s (such as [[Programming language|computer-programming languages]] or the "languages" used in the study of formal [[logic]], especially [[mathematical logic]]) and from [[constructed language]]s.
== Defining natural language ==
Though the exact definition is debatable, natural language is often contrasted with artificial or [[constructed languages]] such as [[Esperanto]], [[Latino Sexione]], and [[Occidental language|Occidental]].
Linguists have an incomplete understanding of all aspects of the rules underlying natural languages, and these rules are therefore objects of study. The understanding of natural languages reveals much about not only how language works (in terms of [[syntax]], [[semantics]], [[phonetics]], [[phonology]], etc), but also about how the human [[mind]] and the human [[brain]] process language. In linguistic terms, 'natural language' only applies to a language that has evolved naturally, and the study of natural language primarily involves native (first language) speakers.
The theory of [[universal grammar]] proposes that all natural languages have certain underlying rules which constrain the structure of the specific grammar for any given language.
While [[grammarians]], writers of dictionaries, and language policy-makers all have a certain influence on the evolution of language, their ability to influence what people think they 'ought' to say is distinct from what people actually say. Natural language applies to the latter, and is thus a 'descriptive' rather than a 'prescriptive' term. Thus non-standard language varieties (such as [[African American Vernacular English]]) are considered to be natural while standard language varieties (such as [[Standard American English]]) which are more 'prescripted' can be considered to be at least somewhat artificial or constructed.
== Native language learning ==
The [[learning]] of one's own [[native language]], typically that of one's [[parent]]s, normally occurs spontaneously in early human [[childhood]] and is [[Biology|biologically]] driven. A crucial role of this process is performed by the [[Nervous system|neural]] activity of a portion of the human [[brain]] known as [[Broca's area]].
There are approximately 7,000 current human languages, and many, if not most seem to share certain properties, leading to the belief in the existence of [[Universal Grammar]], as shown by [[generative grammar]] studies pioneered by the work of [[Noam Chomsky]]. Recently, it has been demonstrated that a dedicated network in the human brain (crucially involving [[Broca's area]], a portion of the left inferior frontal gyrus), is selectively activated by complex verbal structures (but not simple ones) of those languages that meet the Universal Grammar requirements.
== Origins of natural language ==
There is disagreement among anthropologists on when language was first used by humans (or their ancestors). Estimates range from about two million (2,000,000) years ago, during the time of ''[[Homo habilis]]'', to as recently as forty thousand (40,000) years ago, during the time of [[Cro-Magnon]] man. However recent evidence suggests modern human language was invented or evolved in Africa prior to the dispersal of humans from Africa around 50,000 years ago. Since all people including the most isolated indigenous groups such as the [[Andamanese]] or the [[Tasmanian aboriginals]] possess language, then it must have been present in the ancestral populations in Africa before the human population split into various groups to colonize the rest of the world.
Some claim that all nautural languages came out of one single language, known as [[Adamic]].
== Linguistic diversity ==
As of early 2007, there are 6,912 known living human languages. A "living language" is simply one which is in wide use by a specific group of living people. The exact number of known living languages will vary from 5,000 to 10,000, depending generally on the precision of one's definition of "language", and in particular on how one classifies [[dialects]]. There are also many dead or [[extinct language]]s.
There is no [[dialect#.22Dialect.22 or .22language.22|clear distinction]] between a language and a [[dialect]], notwithstanding linguist [[Max Weinreich]]'s famous [[aphorism]] that "[[a language is a dialect with an army and navy]]." In other words, the distinction may hinge on political considerations as much as on cultural differences, distinctive [[writing system]]s, or degree of [[mutual intelligibility]].
It is probably impossible to accurately enumerate the living languages because our worldwide knowledge is incomplete, and it is a "moving target", as explained in greater detail by the [[Ethnologue]]'s Introduction, p. 7 - 8. With the 15th edition, the 103 newly added languages are not new but reclassified due to refinements in the definition of language.
Although widely considered an [[encyclopedia]], the [[Ethnologue]] actually presents itself as an incomplete catalog, including only named languages that its editors are able to document. With each edition, the number of catalogued languages has grown.
Beginning with the 14th edition (2000), an attempt was made to include all known living languages. SIL used an internal 3-letter code fashioned after [[airport code]]s to identify languages. This was the precursor to the modern [[ISO 639-3]] standard, to which SIL contributed. The standard allows for over 14,000 languages. In turn, the 15th edition was revised to conform to the pending ISO 639-3 standard.
Of the catalogued languages, 497 have been flagged as "nearly extinct" due to trends in their usage.
Per the 15th edition, 6,912 living languages are shared by over 5.7 billion speakers. (p. 15)
== Taxonomy ==
The [[Taxonomic classification|classification]] of natural languages can be performed on the basis of different underlying principles (different closeness notions, respecting different properties and relations between languages); important directions of present classifications are:
* paying attention to the historical evolution of languages results in a genetic classification of languages—which is based on genetic relatedness of languages,
* paying attention to the internal structure of languages ([[grammar]]) results in a typological classification of languages—which is based on similarity of one or more components of the language's grammar across languages,
* and respecting geographical closeness and contacts between language-speaking communities results in areal groupings of languages.
The different classifications do not match each other and are not expected to, but the correlation between them is an important point for many [[linguistics|linguistic]] research works. (There is a parallel to the classification of [[species]] in biological [[phylogenetics]] here: consider [[monophyletic]] vs. [[polyphyletic]] groups of species.)
The task of genetic classification belongs to the field of [[historical-comparative linguistics]], of typological—to [[linguistic typology]].
See also [[Taxonomy]], and [[Taxonomic classification]] for the general idea of classification and taxonomies.
==== Genetic classification ====
The world's languages have been grouped into families of languages that are believed to have common ancestors. Some of the major families are the [[Indo-European languages]], the [[Afro-Asiatic languages]], the [[Austronesian languages]], and the [[Sino-Tibetan languages]].
The shared features of languages from one family can be due to shared ancestry. (Compare with [[homology (biology)|homology]] in biology.)
==== Typological classification ====
An example of a typological classification is the classification of languages on the basis of the basic order of the [[verb]], the [[subject (grammar)|subject]] and the [[object (grammar)|object]] in a [[sentence (linguistics)|sentence]] into several types: [[SVO language|SVO]], [[SOV language|SOV]], [[VSO language|VSO]], and so on, languages. ([[English language|English]], for instance, belongs to the [[SVO language]] type.)
The shared features of languages of one type (= from one typological class) may have arisen completely independently. (Compare with [[analogy (biology)|analogy]] in biology.) Their cooccurence might be due to the universal laws governing the structure of natural languages—[[language universal]]s.
==== Areal classification ====
The following language groupings can serve as some linguistically significant examples of areal linguistic units, or ''[[sprachbund]]s'': [[Balkan linguistic union]], or the bigger group of [[European languages]]; [[Caucasian languages]]; [[East Asian languages]]. Although the members of each group are not closely [[genetic relatedness of languages|genetically related]], there is a reason for them to share similar features, namely: their speakers have been in contact for a long time within a common community and the languages ''converged'' in the course of the history. These are called "[[areal feature (linguistics)|areal feature]]s".
One should be careful about the underlying classification principle for groups of languages which have apparently a geographical name: besides areal linguistic units, the [[taxa]] of the genetic classification ([[language family|language families]]) are often given names which themselves or parts of which refer to geographical areas.
== Controlled languages ==
Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity. The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language. An example of a widely used controlled natural language is [[Simplified English]], which was originally developed for [[aerospace]] industry maintenance manuals.
== Constructed languages and international auxiliary languages ==
Constructed [[international auxiliary language]]s such as [[Esperanto]] and [[Interlingua]] that have [[native speaker]]s are by some also considered natural languages. However, constructed languages, while they are clearly languages, are not generally considered natural languages. The problem is that other languages have been used to communicate and evolve in a natural way, while Esperanto has been selectively designed by [[L.L. Zamenhof]] from natural languages, not grown from the natural fluctuations in vocabulary and syntax. Nor has Esperanto been naturally "standardized" by children's natural tendency to correct for illogical grammar structures in their parents' language, which can be seen in the development of [[pidgin]] languages into [[creole language]]s (as explained by Steven Pinker in [[The Language Instinct]]). The possible exception to this are true native speakers of such languages. More substantive basis for this designation is that the vocabulary, grammar, and orthography of Interlingua are natural; they have been standardized and presented by a [[International Auxiliary Language Association|linguistic research body]], but they predated it and are not themselves considered a product of human invention. Most experts, however, consider Interlingua to be naturalistic rather than natural. [[Latino Sine Flexione]], a second naturalistic auxiliary language, is also naturalistic in content but is no longer widely spoken.
==Natural Language Processing==
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages.
Natural-language-generation systems convert information from computer databases into normal-sounding human language. Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
== Modalities ==
Natural language manifests itself in modalities other than speech.
=== Sign languages ===
In linguistic terms, sign languages are as rich and complex as any oral language, despite the previously common misconception that they are not "real languages". Professional linguists have studied many sign languages and found them to have every linguistic component required to be classed as true natural languages.
Sign languages are not [[pantomime]], much as most spoken language is not [[onomatopoeic]]. The signs do tend to exploit iconicity (visual connections with their referents) more than what is common in spoken language, but they are above all conventional and hence generally incomprehensible to non-speakers, just like spoken words and morphemes. They are not a visual rendition of an oral language either. They have complex grammars of their own, and can be used to discuss any topic, from the simple and concrete to the lofty and abstract.
=== Written languages ===
In a sense, written language should be distinguished from natural language. Until recently in the developed world, it was common for many people to be fluent in [[spoken language|spoken]] or [[sign language|signed languages]] and yet remain illiterate; this is still the case in poor countries today. Furthermore, natural [[language acquisition]] during childhood is largely spontaneous, while [[literacy]] must usually be intentionally acquired.
Natural language processing
'''Natural language processing''' ('''NLP''') is a subfield of [[artificial intelligence]] and [[computational linguistics]]. It studies the problems of automated generation and understanding of [[natural language|natural human languages]].
Natural-language-generation systems convert information from computer databases into normal-sounding human language. Natural-language-understanding systems convert samples of human language into more formal representations that are easier for [[computer]] programs to manipulate.
==Tasks and limitations==
In theory, natural-language processing is a very attractive method of [[human-computer interaction]]. Early systems such as [[SHRDLU]], working in restricted "[[blocks world]]s" with restricted vocabularies, worked extremely well, leading researchers to excessive optimism, which was soon lost when the systems were extended to more realistic situations with real-world [[ambiguity]] and [[complexity]].
Natural-language understanding is sometimes referred to as an [[AI-complete]] problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it. The definition of "[[understanding]]" is one of the major problems in natural-language processing.
==Concrete problems==
Some examples of the problems faced by natural-language-understanding systems:
* The sentences ''We gave the monkeys the bananas because they were hungry'' and ''We gave the monkeys the bananas because they were over-ripe'' have the same surface grammatical structure. However, the pronoun ''they'' refers to ''monkeys'' in one sentence and ''bananas'' in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas.
* A string of words may be interpreted in different ways. For example, the string ''Time flies like an arrow'' may be interpreted in a variety of ways:
**The common [[simile]]: ''[[time]]'' moves quickly just like an arrow does;
**measure the speed of flies like you would measure that of an arrow (thus interpreted as an imperative) - i.e. ''(You should) time flies as you would (time) an arrow.'';
**measure the speed of flies like an arrow would - i.e. ''Time flies in the same way that an arrow would (time them).'';
**measure the speed of flies that are like arrows - i.e. ''Time those flies that are like arrows'';
**all of a type of flying insect, "time-flies," collectively enjoys a single arrow (compare ''Fruit flies like a banana'');
**each of a type of flying insect, "time-flies," individually enjoys a different arrow (similar comparison applies);
**A concrete object, for example the magazine, ''[[Time (magazine)|Time]]'', travels through the air in an arrow-like manner.
English is particularly challenging in this regard because it has little [[inflectional morphology]] to distinguish between [[parts of speech]].
* English and several other languages don't specify which word an adjective applies to. For example, in the string "pretty little girls' school".
** Does the school look little?
** Do the girls look little?
** Do the girls look pretty?
** Does the school look pretty?
* We will often imply additional information in spoken language by the way we place stress on words. The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it. Depending on which word the speaker places the stress, this sentence could have several distinct meanings:
** "'''I''' never said she stole my money" - Someone else said it, but ''I'' didn't.
** "I '''never''' said she stole my money" - I simply didn't ever say it.
** "I never '''said''' she stole my money" - I might have implied it in some way, but I never explicitly said it.
** "I never said '''she''' stole my money" - I said someone took it; I didn't say it was she.
** "I never said she '''stole''' my money" - I just said she probably borrowed it.
** "I never said she stole '''my''' money" - I said she stole someone else's money.
** "I never said she stole my '''money'''" - I said she stole something, but not my money.
==Subproblems==
; [[Speech segmentation]]: In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the analog signal to discrete characters can be a very difficult process. Also, in [[natural speech]] there are hardly any pauses between successive words; the location of those boundaries usually must take into account [[grammatical]] and [[semantic]] constraints, as well as the [[context]].
; [[Text segmentation]]: Some written languages like [[Chinese language|Chinese]], [[Japanese language|Japanese]] and [[Thai language|Thai]] do not have single-word boundaries either, so any significant text [[parsing]] usually requires the identification of word boundaries, which is often a non-trivial task.
; [[Word sense disambiguation]]: Many words have more than one [[meaning]]; we have to select the meaning which makes the most sense in context.
; [[Syntactic ambiguity]]: The [[grammar]] for [[natural language]]s is [[ambiguous]], i.e. there are often multiple possible [[parse tree]]s for a given sentence. Choosing the most appropriate one usually requires [[semantics|semantic]] and contextual information. Specific problem components of syntactic ambiguity include [[sentence boundary disambiguation]].
; Imperfect or irregular input : Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, [[Optical character recognition|OCR]] errors in texts.
; [[Speech acts]] and plans: A sentence can often be considered an action by the speaker. The sentence structure, alone, may not contain enough information to define this action. For instance, a question is actually the speaker requesting some sort of response from the listener. The desired response may be verbal, physical, or some combination. For example, "Can you pass the class?" is a request for a simple yes-or-no answer, while "Can you pass the salt?" is requesting a physical action to be performed. It is not appropriate to respond with "Yes, I can pass the salt," without the accompanying action (although "No" or "I can't reach the salt" would explain a lack of action).
== Statistical NLP ==
Statistical natural-language processing uses [[stochastic]], [[probabilistic]] and [[statistical]] methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars, yielding thousands or millions of possible analyses. Methods for disambiguation often involve the use of [[corpus linguistics | corpora]] and [[Markov model]]s. Statistical NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, [[information theory]], and [[linear algebra]]. The
technology for statistical NLP comes mainly from [[machine learning]] and [[data mining]], both of which are fields of [[artificial intelligence]]
that involve learning from data.
==Major tasks in NLP==
* [[Automatic summarization]]
* [[Foreign language reading aid]]
* [[Foreign language writing aid]]
* [[Information extraction]]
* [[Information retrieval]]
* [[Machine translation]]
* [[Named entity recognition]]
* [[Natural language generation]]
* [[Natural language understanding]]
* [[Optical character recognition]]
* [[Question answering]]
* [[Speech recognition]]
* [[Spoken dialogue system]]
* [[Text simplification]]
* [[Text to speech]]
* [[Text-proofing]]
== Evaluation of natural language processing ==
===Objectives===
The goal of NLP evaluation is to measure one or more ''qualities'' of an algorithm or a system, in order to determine if (or to what extent) the system answers the goals of its designers, or the needs of its users. Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem, going thus beyond the vagueness of tasks defined only as ''language understanding'' or ''language generation''. A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem.
===Short history of evaluation in NLP===
The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 (Pallet 1998). Then, the Parseval/GEIG project compared phrase-structure grammars (Black 1991). A series of campaigns within Tipster project were realized on tasks like summarization, translation and searching (Hirshman 1998). In 1994, in Germany, the Morpholympics compared German taggers. Then, the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation. In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian). In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999). In 2004, during the [[Technolangue/Easy]] project, 13 parsers for French were compared. Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007. In Italy, the evalita campaign was conducted in 2007 to compare various tools for Italian [http://evalita.itc.it evalita web site]. In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared [http://atoll.inria.fr/passage/ passage web site].
Adda G., Mariani J., Paroubek P., Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discours pour le français. Langues vol-2
Black E., Abney S., Flickinger D., Gdaniec C., Grishman R., Harrison P., Hindle D., Ingria R., Jelinek F., Klavans J., Liberman M., Marcus M., Reukos S., Santoni B., Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars. DARPA Speech and Natural Language Workshop
Hirshman L. 1998 Language understanding evaluation: lessons learned from MUC and ATIS. LREC Granada
Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests. LREC Granada
===Different types of evaluation===
Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.
* Intrinsic vs. extrinsic evaluation
Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a ''gold standard'' result, pre-defined by the evaluators. Extrinsic evaluation, also called ''evaluation in use'' considers the NLP system in a more complex setting, either as an embedded system or serving a precise function for a human user. The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user.
* Black-box vs. glass-box evaluation
Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process (speed, reliability, resource consumption) and, most importantly, to the quality of the result (e.g. the accuracy of data annotation or the fidelity of a translation). Glass-box evaluation looks at the design of the system, the algorithms that are implemented, the linguistic resources it uses (e.g. vocabulary size), etc. Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation, but this type of evaluation is more informative with respect to error analysis or future developments of a system.
* Automatic vs. manual evaluation
In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard (or desired) one. Although the cost of producing the gold standard can be quite high, automatic evaluation can be repeated as often as needed without much additional costs (on the same input data). However, for many NLP problems, the definition of a gold standard is a complex task, and can prove impossible when inter-annotator agreement is insufficient. Manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria. Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings. This is why automatic evaluation is sometimes referred to as ''objective'' evaluation, while the human kind appears to be more ''subjective.''
=== Shared tasks (Campaigns)===
* [[BioCreative]]
* [[Message Understanding Conference]]
* [[Technolangue/Easy]]
* [[Text Retrieval Conference]]
==Standardization in NLP==
An ISO sub-committee is working in order to ease interoperability between [[Lexical resource]]s and NLP programs. The sub-committee is part of [[ISO/TC37]] and is called ISO/TC37/SC4. Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see [[lexical markup framework|LMF]]), annotation and data category registry.
Neural network
Traditionally, the term '''neural network''' had been used to refer to a network or circuit of [[neuron|biological neurons]]. The modern usage of the term often refers to [[artificial neural network]]s, which are composed of [[artificial neuron]]s or nodes. Thus the term has two distinct usages:
# [[Biological neural network]]s are made up of real biological neurons that are connected or functionally-related in the [[peripheral nervous system]] or the [[central nervous system]]. In the field of [[neuroscience]], they are often identified as groups of neurons that perform a specific physiological function in laboratory analysis.
# [[Artificial neural network]]s are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons). Artificial neural networks may either be used to gain an understanding of biological neural networks, or for solving artificial intelligence problems without necessarily creating a model of a real biological system.
This article focuses on the relationship between the two concepts; for detailed coverage of the two different concepts refer to the separate articles: [[Biological neural network]] and [[Artificial neural network]].
==Characterization==
In general a biological neural network is composed of a group or groups of chemically connected or functionally associated neurons. A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive. Connections, called [[synapses]], are usually formed from [[axons]] to [[dendrites]], though dendrodendritic microcircuits and other connections are possible. Apart from the electrical signaling, there are other forms of signaling that arise from [[neurotransmitter]] diffusion, which have an effect on electrical signaling. As such, neural networks are extremely complex. [[Artificial intelligence]] and [[cognitive modeling]] try to simulate some properties of neural networks. While similar in their techniques, the former has the aim of solving particular tasks, while the latter aims to build mathematical models of biological neural systems.
In the [[artificial intelligence]] field, artificial neural networks have been applied successfully to [[speech recognition]], [[image analysis]] and adaptive [[control]], in order to construct [[software agents]] (in [[Video game|computer and video games]]) or [[autonomous robot]]s. Most of the currently employed artificial neural networks for artificial intelligence are based on [[statistical estimation]], [[Optimization (mathematics)|optimization]] and [[control theory]].
The [[cognitive modelling]] field involves the physical or mathematical modeling of the behaviour of neural systems; ranging from the individual neural level (e.g. modelling the spike response curves of neurons to a stimulus), through the neural cluster level (e.g. modelling the release and effects of dopamine in the basal ganglia) to the complete organism (e.g. behavioural modelling of the organism's response to stimuli).
==The brain, neural networks and computers==
Neural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is debated.
A subject of current research in theoretical neuroscience is the question surrounding the degree of complexity and the properties that individual neural elements should have to reproduce something resembling animal intelligence.
Historically, computers evolved from the [[von Neumann architecture]], which is based on sequential processing and execution of explicit instructions. On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems, which may rely largely on parallel processing as well as implicit instructions based on recognition of patterns of 'sensory' input from external sources. In other words, at its very heart a neural network is a complex statistical processor (as opposed to being tasked to sequentially process and execute).
==Neural networks and artificial intelligence==
An ''artificial neural network'' (ANN), also called a ''simulated neural network'' (SNN) or commonly just ''neural network'' (NN) is an interconnected group of [[artificial neuron]]s that uses a [[mathematical model|mathematical or computational model]] for [[information processing]] based on a [[connectionism|connectionistic]] approach to [[computation]]. In most cases an ANN is an [[adaptive system]] that changes its structure based on external or internal information that flows through the network.
In more practical terms neural networks are [[non-linear]] [[statistical]] [[data modeling]] or [[decision making]] tools. They can be used to model complex relationships between inputs and outputs or to [[Pattern recognition|find patterns]] in data.
===Background===
An [[artificial neural network]] involves a network of simple processing elements ([[artificial neurons]]) which can exhibit complex global behaviour, determined by the connections between the processing elements and element parameters. One classical type of artificial neural network is the [[Hopfield net]].
In a neural network model simple [[Node (neural networks)|nodes]], which can be called variously "neurons", "neurodes", "Processing Elements" (PE) or "units", are connected together to form a network of nodes — hence the term "neural network". While a neural network does not have to be adaptive ''per se'', its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
In modern [[Neural network software|software implementations]] of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing. In some of these systems neural networks, or parts of neural networks (such as [[artificial neuron]]s) are used as components in larger systems that combine both adaptive and non-adaptive elements.
The concept of a neural network appears to have first been proposed by [[Alan Turing]] in his 1948 paper "Intelligent Machinery".
===Applications===
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it. This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
====Real life applications====
The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
*[[Function approximation]], or [[regression analysis]], including [[time series prediction]] and modelling.
*[[Statistical classification|Classification]], including [[Pattern recognition|pattern]] and sequence recognition, novelty detection and sequential decision making.
*[[Data processing]], including filtering, clustering, [[blind signal separation]] and compression.
Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition, etc.), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, [[data mining]] (or knowledge discovery in databases, "KDD"), visualization and [[e-mail spam]] filtering.
===Neural network software===
''Main article:'' [[Neural network software]]
'''Neural network software''' is used to [[Simulation|simulate]], [[research]], [[Software development|develop]] and apply [[artificial neural network]]s, [[biological neural network]]s and in some cases a wider array of [[adaptive system]]s.
====Learning paradigms====
There are three major learning paradigms, each corresponding to a particular abstract learning task. These are [[supervised learning]], [[unsupervised learning]] and [[reinforcement learning]]. Usually any given type of network architecture can be employed in any of those tasks.
;Supervised learning
In [[supervised learning]], we are given a set of example pairs and the aim is to find a function in the allowed class of functions that matches the examples. In other words, we wish to ''infer'' how the mapping implied by the data and the cost function is related to the mismatch between our mapping and the data.
;Unsupervised learning
In [[unsupervised learning]] we are given some data , and a cost function which is to be minimized which can be any function of and the network's output, . The cost function is determined by the task formulation. Most applications fall within the domain of [[estimation problems]] such as [[statistical modeling]], [[Data compression|compression]], [[Mail filter|filtering]], [[blind source separation]] and [[data clustering|clustering]].
;Reinforcement learning
In [[reinforcement learning]], data is usually not given, but generated by an agent's interactions with the environment. At each point in time , the agent performs an action and the environment generates an observation and an instantaneous cost , according to some (usually unknown) dynamics. The aim is to discover a ''policy'' for selecting actions that minimises some measure of a long-term cost, i.e. the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated. ANNs are frequently used in reinforcement learning as part of the overall algorithm. Tasks that fall within the paradigm of reinforcement learning are [[control]] problems, [[game]]s and other [[sequential decision making]] tasks.
====Learning algorithms====
There are many algorithms for training neural networks; most of them can be viewed as a straightforward application of [[Optimization (mathematics)|optimization]] theory and [[statistical estimation]]. [[Evolutionary computation]] methods, [[simulated annealing]], [[Expectation-Maximization|expectation maximization]] and [[non-parametric methods]] are among other commonly used methods for training neural networks. See also [[machine learning]].
Recent developments in this field also saw the use of [[particle swarm optimization]] and other [[swarm intelligence]] techniques used in the training of neural networks.
==Neural networks and neuroscience==
Theoretical and [[computational neuroscience]] is the field concerned with the theoretical analysis and computational modeling of biological neural systems.
Since neural systems are intimately related to cognitive processes and behaviour, the field is closely related to cognitive and behavioural modeling.
The aim of the field is to create models of biological neural systems in order to understand how biological systems work. To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning ([[biological neural network]] models) and theory (statistical learning theory and [[information theory]]).
=== Types of models ===
Many models are used in the field, each defined at a different level of abstraction and trying to model different aspects of neural systems. They range from models of the short-term behaviour of [[biological neuron models|individual neurons]], through models of how the dynamics of neural circuitry arise from interactions between individual neurons, to models of how behaviour can arise from abstract neural modules that represent complete subsystems. These include models of the long-term and short-term plasticity of neural systems and its relation to learning and memory, from the individual neuron to the system level.
===Current research===
While initially research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of [[neuromodulators]] such as [[dopamine]], [[acetylcholine]], and [[serotonin]] on behaviour and learning. [[Biophysics|Biophysical]] models, such as [[BCM theory]], have been important in understanding mechanisms for [[synaptic plasticity]], and have had applications in both computer science and neuroscience. Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for [[radial basis networks]] and [[neural backpropagation]] as mechanisms for processing data.
==History of the neural network analogy==
The concept of neural networks started in the late-1800s as an effort to describe how the human mind performed. These ideas started being applied to computational models with the [[Perceptron]].
In early 1950s [[Friedrich Hayek]] was one of the first to posit the idea of [[spontaneous order]] in the brain arising out of decentralized networks of simple units (neurons). In the late 1940s, [[Donald Hebb]] made one of the first hypotheses for a mechanism of neural plasticity (i.e. learning), [[Hebbian learning]]. Hebbian learning is considered to be a 'typical' unsupervised learning rule and it (and variants of it) was an early model for [[long term potentiation]].
The [[Perceptron]] is essentially a linear classifier for classifying data specified by parameters and an output function . Its parameters are adapted with an ad-hoc rule similar to stochastic steepest gradient descent. Because the [[inner product]] is a [[linear operator]] in the input space, the Perceptron can only perfectly classify a set of data for which different classes are [[linearly separable]] in the input space, while it often fails completely for non-separable data. While the development of the algorithm initially generated some enthusiasm, partly because of its apparent relation to biological mechanisms, the later discovery of this inadequacy caused such models to be abandoned until the introduction of non-linear models into the field.
The [[Cognitron]] (1975) was an early multilayered neural network with a training algorithm. The actual structure of the network and the methods used to set the interconnection weights change from one neural strategy to another, each with its advantages and disadvantages. Networks can propagate information in one direction only, or they can bounce back and forth until self-activation at a node occurs and the network settles on a final state. The ability for bi-directional flow of inputs between neurons/nodes was produced with the [[Hopfield net|Hopfield's network]] (1982), and specialization of these node layers for specific purposes was introduced through the first [[hybrid neural network|hybrid network]].
The [[connectionism|parallel distributed processing]] of the mid-1980s became popular under the name [[connectionism]].
The rediscovery of the [[backpropagation]] algorithm was probably the main reason behind the repopularisation of neural networks after the publication of "Learning Internal Representations by Error Propagation" in 1986 (Though backpropagation itself dates from 1974). The original network utilised multiple layers of weight-sum units of the type , where was a [[sigmoid function]] or [[logistic function]] such as used in [[logistic regression]]. Training was done by a form of stochastic steepest gradient descent. The employment of the chain rule of differentiation in deriving the appropriate parameter updates results in an algorithm that seems to 'backpropagate errors', hence the nomenclature. However it is essentially a form of gradient descent. Determining the optimal parameters in a model of this type is not trivial, and steepest gradient descent methods cannot be relied upon to give the solution without a good starting point. In recent times, networks with the same architecture as the backpropagation network are referred to as [[Multilayer perceptron|Multi-Layer Perceptrons]]. This name does not impose any limitations on the type of algorithm used for learning.
The backpropagation network generated much enthusiasm at the time and there was much controversy about whether such learning could be implemented in the brain or not, partly because a mechanism for reverse signalling was not obvious at the time, but most importantly because there was no plausible source for the 'teaching' or 'target' signal.
==Criticism==
[[A. K. Dewdney]], a former ''[[Scientific American]]'' columnist, wrote in 1997, ''“Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool.”'' (Dewdney, p.82)
Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft[http://www.nasa.gov/centers/dryden/news/NewsReleases/2003/03-49.html] to detecting credit card fraud[http://www.visa.ca/en/about/visabenefits/innovation.cfm].
Technology writer [[Roger Bridgman]] commented on Dewdney's statements about neural nets:
Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource".
In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers. An unreadable table that a useful machine could read would still be well worth having.
N-gram
An '''''n''-gram''' is a sub-sequence of ''n'' items from a given [[sequence]]. ''n''-grams are used in various areas of statistical [[natural language processing]] and genetic sequence analysis. The items in question can be letters, words or [[base pairs]] according to the application.
An ''n''-gram of size 1 is a "[[unigram]]"; size 2 is a "[[bigram]]" (or, more etymologically sound but less commonly used, a "digram"); size 3 is a "[[trigram]]"; and size 4 or more is simply called an "''n''-gram". Some [[language model]]s built from n-grams are "(''n'' − 1)-order [[Markov_chain|Markov model]]s".
==Examples==
Here are examples of '''''word''''' level 3-grams and 4-grams (and counts of the number of times they appeared) from the [[N-gram#Google_use_of_N-gram|Google n-gram corpus]].
*ceramics collectables collectibles (55)
*ceramics collectables fine (130)
*ceramics collected by (52)
*ceramics collectible pottery (50)
*ceramics collectibles cooking (45)
4-grams
*serve as the incoming (92)
*serve as the incubator (99)
*serve as the independent (794)
*serve as the index (223)
*serve as the indication (72)
*serve as the indicator (120)
==''n''-gram models==
An '''''n''-gram model''' models sequences, notably natural languages, using the statistical properties of ''n''-grams.
This idea can be traced to an experiment by [[Claude Shannon]]'s work in [[information theory]]. His question was, given a sequence of letters (for example, the sequence "for ex"), what is the [[likelihood]] of the next letter? From training data, one can derive a [[probability distribution]] for the next letter given a history of size : ''a'' = 0.4, ''b'' = 0.00001, ''c'' = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0.
More concisely, an ''n''-gram model predicts based on . In Probability terms, this is nothing but . When used for [[language model|language modeling]] independence assumptions are made so that each word depends only on the last ''n'' words. This [[Markov model]] is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.
''n''-gram models are widely used in statistical [[natural language processing]]. In [[speech recognition]], [[phonemes]] and sequences of phonemes are modeled using a ''n''-gram distribution. For parsing, words are modeled such that each ''n''-gram is composed of ''n'' words. For [[language recognition]], sequences of letters are modeled for different languages. For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "the dog smelled", "dog smelled like", "smelled like a", and "like a skunk". For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth. Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks. Punctuation is also commonly reduced or removed by preprocessing. ''n''-grams can also be used for sequences of words or, in fact, for almost any type of data. They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from. They have also been very successful as the first pass in genetic sequence search and in the identification of which species short sequences of DNA were taken from.
N-gram models are often criticized because they lack any explicit representation of long range dependency. While it is true that the only explicit dependency range is (n-1) tokens for an n-gram model, it is also true that the effective range of dependency is significantly longer than this although long range correlations drop exponentially with distance for any Markov model. Alternative Markov language models that incorporate some degree of local state can exhibit very long range dependencies. This is often done using hand-crafted state variables that represent, for instance, the position in a sentence, the general topic of discourse or a grammatical state variable. Some of the best parsers of English currently in existence are roughly of this form.
Another criticism that has been leveled is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction introduced by [[Noam Chomsky]]. This criticism fails to explain why parsers that are the best at parsing text seem to uniformly lack any such distinction and most even lack any clear distinction between semantics and syntax. Most proponents of n-gram and related language models opt for a fairly pragmatic approach to language modeling that emphasizes empirical results over theoretical purity.
==''n''-grams for approximate matching==
''n''-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of ''n''-grams, it can be embedded in a [[vector space]] (in other words, represented as a [[histogram]]), thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a -dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams. However, we know empirically that if two strings of real text have a similar vector representation (as measured by [[dot product|cosine distance]]) then they are likely to be similar. Other metrics have also been applied to vectors of ''n''-grams with varying, sometimes better, results. For example [[z-score]]s have been used to compare documents by examining how many standard deviations each ''n''-gram differs from its mean occurrence in a large collection, or [[text corpus]], of documents (which form the "background" vector). In the event of small counts, the [[g-score]] may give better results for comparing alternative models.
It is also possible to take a more principled approach to the statistics of ''n''-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in [[Bayesian inference]].
==Other applications==
''n''-grams find use in several areas of computer science, [[computational linguistics]], and applied mathematics.
They have been used to:
* design [[kernel (mathematics)|kernels]] that allow [[machine learning]] algorithms such as [[support vector machine]]s to learn from string data
* find likely candidates for the correct spelling of a misspelled word
* improve compression in [[data compression|compression algorithms]] where a small area of data requires ''n''-grams of greater length
* assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, [[speech recognition]], OCR ([[optical character recognition]]), [[Intelligent Character Recognition]] ([[ICR]]), [[machine translation]] and similar applications
* improve retrieval in [[information retrieval]] systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents
* improve retrieval performance in genetic sequence analysis as in the [[BLAST]] family of programs
* identify the language a text is in or the species a small sequence of DNA was taken from
* predict letters or words at random in order to create text, as in the [[dissociated press]] algorithm.
== Bias-versus-variance trade-off ==
What goes into picking the ''n'' for the ''n''-gram?
There are problems of balance weight between ''infrequent grams'' (for example, if a proper name appeared in the training data) and ''frequent grams''. Also, items not seen in the training data will be given a [[probability]] of 0.0 without [[smoothing]]. For unseen but plausible data from a sample, one can introduce [[pseudocount]]s. Pseudocounts are generally motivated on Bayesian grounds.
=== Smoothing techniques ===
* [[Linear interpolation]] (e.g., taking the [[weighted mean]] of the unigram, bigram, and trigram)
* [[Good-Turing]] discounting
* [[Witten-Bell discounting]]
* [[Katz's back-off model]] (trigram)
==Google use of N-gram==
[[Google]] uses n-gram models for a variety of R&D projects, such as [[statistical machine translation]], [[speech recognition]], [[Spell checker|checking spelling]], [[entity detection]], and [[information extraction|data mining]]. In September of 2006 [http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Google announced] that they made their n-grams [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 public] at the [[Linguistic Data Consortium]] ([http://www.ldc.upenn.edu/ LDC]).
Noun
In [[linguistics]], a '''noun''' is a member of a large, [[open class (linguistics)|open]] [[lexical category]] whose members can occur as the main word in the [[subject (grammar)|subject]] of a [[clause]], the [[object (grammar)|object]] of a [[verb]], or the object of a [[preposition]].
Lexical categories are defined in terms of how their members combine with other kinds of expressions. The syntactic rules for nouns differ from language to language. In [[English language|English]], nouns may be defined as those words which can occur with articles and [[adjective|attributive adjectives]] and can function as the [[phrase|head]] of a [[noun phrase]].
In [[traditional grammar|traditional]] English grammar, the noun is one of the eight [[parts of speech]].
==History==
The word comes from the [[Latin]] ''nomen'' meaning "[[name]]". Word classes like nouns were first described by the Sanskrit grammarian [[Panini (grammarian)|{{IAST|Pāṇini}}]] and ancient Greeks like [[Dionysios Thrax]]; and were defined in terms of their [[morphology (linguistics)|morphological]] properties. For example, in Ancient Greek, nouns inflect for [[case (grammar)|grammatical case]], such as dative or accusative. [[Verb]]s, on the other hand, inflect for [[grammatical tense|tenses]], such as past, present or future, while nouns do not. [[Aristotle]] also had a notion of ''onomata'' (nouns) and ''rhemata'' (verbs) which, however, does not exactly correspond with modern notions of nouns and verbs.
Vinokurova 2005 has a more detailed discussion of the historical origin of the notion of a noun.
==Different definitions of nouns==
Expressions of [[natural language]] have properties at different levels. They have ''formal'' properties, like what kinds of [[morphology (linguistics)|morphological]] [[prefix]]es or [[suffix]]es they take and what kinds of other expressions they combine with; but they also have [[semantics|semantic]] properties, i.e. properties pertaining to their meaning. The definition of a noun at the outset of this page is thus a ''formal'', traditional grammatical definition. That definition, for the most part, is considered uncontroversial and furnishes the propensity for certain language users to effectively distinguish most nouns from non-nouns. However, it has the disadvantage that it does not apply to nouns in all languages. For example in [[Russian language|Russian]], there are no definite articles, so one cannot define nouns as words that are modified by definite articles. There are also several attempts of defining nouns in terms of their [[semantics|semantic]] properties. Many of these are controversial, but some are discussed below.
===Names for things===
In [[Traditional grammar|traditional school grammars]], one often encounters the definition of nouns that they are all and only those expressions that refer to a ''person'', ''place'', ''thing'', ''event'', ''substance'', ''quality'', or ''idea'', etc. This is a ''semantic'' definition. It has been criticized by contemporary linguists as being uninformative. Contemporary linguists generally agree that one cannot successfully define nouns (or other grammatical categories) in terms of what sort of ''object in the world'' they ''[[reference|refer]] to'' or ''[[signification|signify]]''. Part of the [[conundrum]] is that the definition makes use of relatively ''general'' nouns ("thing", "phenomenon", "event") to define what nouns ''are''. The existence of such ''general'' nouns demonstrates that nouns refer to entities that are organized in [[taxonomy|taxonomic]] [[hierarchies]]. But other kinds of expressions are also organized into such structured taxonomic relationships. For example the verbs "stroll","saunter", "stride", and "tread" are more specific words than the more ''general'' "walk". Moreover, "walk" is more specific than the verb "move", which, in turn, is less general than "change". But it is unlikely that such taxonomic relationships can be used to ''define'' nouns and verbs. We cannot ''define'' verbs as those words that refer to "changes" or "states", for example, because the nouns ''change'' and ''state'' probably refer to such things, but, of course, aren't verbs. Similarly, nouns like "invasion", "meeting", or "collapse" refer to things that are "done" or "happen". In fact, an influential [[theory]] has it that verbs like "kill" or "die" refer to events, which is among the sort of thing that nouns are supposed to refer to. The point being made here is not that this view of verbs is wrong, but rather that this property of verbs is a poor basis for a ''definition'' of this category, just like the property of ''having wheels'' is a poor basis for a definition of cars (some things that have wheels, such as my suitcase or a jumbo jet, aren't cars). Similarly, adjectives like "yellow" or "difficult" might be thought to refer to qualities, and adverbs like "outside" or "upstairs" seem to refer to places, which are also among the sorts of things nouns can refer to. But verbs, adjectives and adverbs are not nouns, and nouns aren't verbs, adjectives or adverbs. One might argue that "definitions" of this sort really rely on speakers' prior intuitive knowledge of what nouns, verbs and adjectives are, and, so don't really add anything over and beyond this. Speakers' intuitive knowledge of such things might plausibly be based on ''formal'' criteria, such as the traditional grammatical definition of English nouns aforementioned.
===Prototypically referential expressions===
Another semantic definition of nouns is that they are ''prototypically referential.'' That definition is also not very helpful in distinguishing actual nouns from verbs. But it may still correctly identify a core property of nounhood. For example, we will tend to use nouns like "fool" and "car" when we wish to refer to fools and cars, respectively. The notion that this is '''prototypical''' reflects the fact that such nouns can be used, even though nothing with the corresponding property is referred to:
:John is no '''fool'''.
:If I had a '''car''', I'd go to Marrakech.
The first sentence above doesn't refer to any fools, nor does the second one refer to any particular car.
===Predicates with identity criteria===
The British logician [[Peter Thomas Geach]] proposed a very subtle semantic definition of nouns. He noticed that adjectives like "same" can modify nouns, but no other kinds of parts of speech, like [[verbs]] or [[adjectives]]. Not only that, but there also doesn't seem to be any ''other'' expressions with similar meaning that can modify verbs and adjectives. Consider the following examples.
: Good: John and Bill participated in the '''same''' fight.
: Bad: *John and Bill '''samely''' fought.
There is no English adverb "samely". In some other languages, like Czech, however there are adverbs corresponding to "samely". Hence, in Czech, the translation of the last sentence would be fine; however, it would mean that John and Bill fought ''in the same way'': not that they participated in the ''same fight''. Geach proposed that we could explain this, if nouns denote logical [[predicate (grammar)|predicate]]s with '''identity criteria'''. An identity criterion would allow us to conclude, for example, that "person x at time 1 is ''the same person'' as person y at time 2". Different nouns can have different identity criteria. A well known example of this is due to Gupta:
:National Airlines transported 2 million '''passengers''' in 1979.
:National Airlines transported (at least) 2 million '''persons''' in 1979.
Given that, in general, all passengers are persons, the last sentence above ought to follow logically from the first one. But it doesn't. It is easy to imagine, for example, that on average, every person who travelled with National Airlines in 1979, travelled with them twice. In that case, one would say that the airline transported 2 million ''passengers'' but only 1 million ''persons''. Thus, the way that we count ''passengers'' isn't necessarily the same as the way that we count ''persons''. Put somewhat differently: At two different times, ''you'' may correspond to two distinct ''passengers'', even though you are one and the same person. For a precise definition of ''identity criteria'', see Gupta.
Recently, Baker has proposed that Geach's definition of nouns in terms of identity criteria allows us to ''explain'' the characteristic properties of nouns. He argues that nouns can co-occur with (in-)definite articles and numerals, and are "prototypically referential" ''because'' they are all and only those [[parts of speech]] that provide identity criteria. Baker's proposals are quite new, and linguists are still evaluating them.
==Classification of nouns in English==
===Proper nouns and common nouns===
''Proper nouns'' (also called ''[[proper name]]s'') are nouns representing unique entities (such as ''London'', ''Universe'' or ''John''), as distinguished from common nouns which describe a class of entities (such as ''city'', ''planet'' or ''person'').
In [[English language|English]] and most other languages that use the [[Latin alphabet]], proper nouns are usually [[capitalization|capitalized]]. Languages differ in whether most elements of multiword proper nouns are capitalised (e.g., American English ''House of Representatives'') or only the initial element (e.g., Slovenian ''Državni zbor'' 'National Assembly'). In [[German language|German]], nouns of all types are capitalized. The convention of capitalizing ''all'' nouns was previously used in English, but ended circa 1800. In America, the shift in capitalization is recorded in several noteworthy documents. The end (but not the beginning) of the [[United States Declaration of Independence#Annotated text of the Declaration|Declaration of Independence]] (1776) and all of the [[United States Constitution|Constitution]] (1787) show nearly all nouns capitalized, the [[United States Bill of Rights#Text of the Bill of Rights|Bill of Rights]] (1789) capitalizes a few common nouns but not most of them, and the [[Thirteenth Amendment to the United States Constitution|Thirteenth Constitutional Amendment]] (1865) only capitalizes proper nouns.
Sometimes the same word can function as both a common noun and a proper noun, where one such entity is special. For example the common noun ''god'' denotes all deities, while the proper noun ''God'' references the [[monotheism|monotheistic]] [[God]] specifically.
Owing to the essentially arbitrary nature of [[Orthography|orthographic]] classification and the existence of variant authorities and adopted [[Style guide|''house styles'']], questionable capitalization of words is not uncommon, even in respected newspapers and magazines. Most publishers, however, properly require ''consistency'', at least within the same document, in applying their specified standard.
The common meaning of the word or words constituting a proper noun may be unrelated to the object to which the proper noun refers. For example, someone might be named "Tiger Smith" despite being neither a [[tiger]] nor a [[smith (metalwork)|smith]]. For this reason, proper nouns are usually not [[translation|translated]] between languages, although they may be [[transliteration|transliterated]]. For example, the German surname ''Knödel'' becomes ''Knodel'' or ''Knoedel'' in English (not the literal ''Dumpling''). However, the [[Transliteration|transcription]] of place names and the names of [[monarch]]s, [[pope]]s, and non-contemporary [[author]]s is common and sometimes universal. For instance, the [[Portuguese language|Portuguese]] word ''Lisboa'' becomes ''[[Lisbon]]'' in [[English language|English]]; the English ''London'' becomes ''Londres'' in French; and the [[ancient Greek|Greek]] ''Aristotelēs'' becomes [[Aristotle]] in English.
===Countable and uncountable nouns===
''Count nouns'' are common nouns that can take a [[plural]], can combine with [[numerals]] or [[quantifiers]] (e.g. "one", "two", "several", "every", "most"), and can take an indefinite article ("a" or "an"). Examples of count nouns are "chair", "nose", and "occasion".
''Mass nouns'' (or ''non-count nouns'') differ from count nouns in precisely that respect: they can't take plural or combine with number words or quantifiers. Examples from English include "laughter", "cutlery", "helium", and "furniture". For example, it is not possible to refer to "a furniture" or "three furnitures". This is true even though the pieces of furniture comprising "furniture" could be counted. Thus the distinction between mass and count nouns shouldn't be made in terms of what sorts of things the nouns ''refer'' to, but rather in terms of how the nouns ''present'' these entities.
===Collective nouns===
''Collective nouns'' are nouns that refer to ''groups'' consisting of more than one individual or entity, even when they are inflected for the [[Grammatical number|singular]]. Examples include "committee", "herd", and "school" (of herring). These nouns have slightly different grammatical properties than other nouns. For example, the [[noun phrases]] that they [[head (syntax)|head]] can serve as the [[subject (grammar)|subject]] of a [[collective predicate]], even when they are inflected for the singular. A [[collective predicate]] is a predicate that normally can't take a singular subject. An example of the latter is "talked to each other".
:Good: The '''boys''' talked to each other.
:Bad: *The '''boy''' talked to each other.
:Good: The '''committee''' talked to each other.
===Concrete nouns and abstract nouns===
''Concrete nouns'' refer to [[physical bodies]] which you use at least one of your [[sense]]s to observe. For instance, "chair", "apple", or "Janet". ''Abstract nouns'' on the other hand refer to [[abstract object]]s, that is ideas or concepts, such as "justice" or "hate". While this distinction is sometimes useful, the boundary between the two of them is not always clear; consider, for example, the noun "art". In English, many abstract nouns are formed by adding noun-forming suffixes ("-ness", "-ity", "-tion") to adjectives or verbs. Examples are "happiness", "circulation" and "serenity".
==Nouns and pronouns==
[[Noun phrase]]s can typically be replaced by [[pronoun]]s, such as "he", "it", "which", and "those", in order to avoid repetition or explicit identification, or for other reasons. For example, in the sentence "Janet thought that he was weird", the word "he" is a pronoun standing in place of the name of the person in question. The English word ''one'' can replace parts of [[noun phrase]]s, and it sometimes stands in for a noun. An example is given below:
: John's car is newer than ''the one'' that Bill has.
But ''one'' can also stand in for bigger subparts of a noun phrase. For example, in the following example, ''one'' can stand in for ''new car''.
: This new car is cheaper than ''that one''.
==Substantive as a word for "noun"==
Starting with old [[Latin language|Latin]] grammars, many European languages use some form of the word ''substantive'' as the basic term for noun. Nouns in the dictionaries of such languages are demarked by the abbreviation "s" instead of "n", which may be used for proper nouns instead. This corresponds to those grammars in which nouns and adjectives phase into each other in more areas than, for example, the English term [[Predicative_adjective#Predicative_adjective|predicate adjective]] entails. In French and Spanish, for example, adjectives frequently act as nouns referring to people who have the characteristics of the adjective. An example in English is:
: The ''poor'' you have always with you.
Similarly, an adjective can also be used for a whole group or organization of people:
: The Socialist ''International''.
Hence, these words are substantives that are usually adjectives in English.
Ontology (information science)
In both [[computer science]] and [[information science]], an '''ontology''' is a formal representation of a set of concepts within a [[Domain of discourse|domain]] and the relationships between those concepts. It is used to [[Reasoning|reason]] about the properties of that domain, and may be used to define the domain.
Ontologies are used in [[artificial intelligence]], the [[Semantic Web]], [[software engineering]], [[biomedical informatics]], [[library science]], and [[information architecture]] as a form of [[knowledge representation]] about the world or some part of it. Common components of ontologies include:
* Individuals: instances or objects (the basic or "ground level" objects)
* [[Class]]es: [[set (computer science)|set]]s, collections, concepts or types of objects
* [[Attribute (computing)|Attribute]]s: properties, features, characteristics, or parameters that objects (and classes) can have
* [[Relation (mathematics)|Relations]]: ways that classes and objects can be related to one another
* Function terms: complex structures formed from certain relations that can be used in place of an individual term in a statement
* Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input
* Rules: statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form
* Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the ontology describes in its domain of application. This definition differs from that of "axioms" in generative grammar and formal logic. In these disciplines, axioms include only statements asserted as ''a priori'' knowledge. As used here, "axioms" also include the theory derived from axiomatic statements.
* [[Event (philosophy)|Events]]: the changing of attributes or relations
Ontologies are commonly encoded using [[ontology language]]s.
== Elements ==
Contemporary ontologies share many structural similarities, regardless of the language in which they are expressed. As mentioned above, most ontologies describe individuals (instances), classes (concepts), attributes, and relations. In this section each of these components is discussed in turn.
=== Individuals ===
Individuals (instances) are the basic, "ground level" components of an ontology. The individuals in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract individuals such as numbers and words. Strictly speaking, an ontology need not include any individuals, but one of the general purposes of an ontology is to provide a means of classifying individuals, even if those individuals are not explicitly part of the ontology.
In formal extensional ontologies, only the utterances of words and numbers are considered individuals – the numbers and names themselves are classes. In a 4D ontology, an individual is identified by its spatio-temporal extent. Examples of formal extensional ontologies are [[ISO 15926]] and the model in development by the [[IDEAS Group]].
=== Classes ===
Classes – concepts that are also called ''type'', ''sort'', ''category'', and ''kind'' – are abstract groups, sets, or collections of objects. They may contain individuals, other classes, or a combination of both. Some examples of classes:
* ''Person'', the class of all people
* ''Vehicle'', the class of all vehicles
* ''Car'', the class of all cars
* ''Class'', representing the class of all classes
* ''Thing'', representing the class of all things
Ontologies vary on whether classes can contain other classes, whether a class can belong to itself, whether there is a universal class (that is, a class containing everything), etc. Sometimes restrictions along these lines are made in order to avoid certain well-known [[paradox]]es.
The classes of an ontology may be [[extensional]] or [[intensional]] in nature. A class is extensional if and only if it is characterized solely by its membership. More precisely, a class C is extensional if and only if for any class C', if C' has exactly the same members as C, then C and C' are identical. If a class does not satisfy this condition, then it is intensional. While extensional classes are more well-behaved and well-understood mathematically, as well as less problematic philosophically, they do not permit the fine grained distinctions that ontologies often need to make. For example, an ontology may want to distinguish between the class of all creatures with a kidney and the class of all creatures with a heart, even if these classes happen to have exactly the same members. In the upper ontologies mentioned above, the classes are defined intensionally. Intensionally defined classes usually have necessary conditions associated with membership in each class. Some classes may also have sufficient conditions, and in those cases the combination of necessary and sufficient conditions make that class a fully ''defined'' class.
Importantly, a class can subsume or be subsumed by other classes; a class subsumed by another is called a ''subclass'' of the subsuming class. For example, ''Vehicle'' subsumes ''Car'', since (necessarily) anything that is a member of the latter class is a member of the former. The subsumption relation is used to create a hierarchy of classes, typically with a maximally general class like ''Thing'' at the top, and very specific classes like ''2002 Ford Explorer'' at the bottom. The critically important consequence of the subsumption relation is the inheritance of properties from the parent (subsuming) class to the child (subsumed) class. Thus, anything that is necessarily true of a parent class is also necessarily true of all of its subsumed child classes. In some ontologies, a class is only allowed to have one parent (''single inheritance''), but in most ontologies, classes are allowed to have any number of parents (''multiple inheritance''), and in the latter case all necessary properties of each parent are inherited by the subsumed child class. Thus a particular class of animal (''HouseCat'') may be a child of the class ''Cat'' and also a child of the class ''Pet''.
A partition is a set of related classes and associated rules that allow objects to be placed into the appropriate class. For example, to the right is the partial diagram of an ontology that has a partition of the ''Car'' class into the classes ''2-Wheel Drive'' and ''4-Wheel Drive''. The partition rule determines if a particular car is placed in the ''2-Wheel Drive'' or the ''4-Wheel Drive'' class.
If the partition rule(s) guarantee that a single ''Car'' cannot be in both classes, then the partition is called a disjoint partition. If the partition rules ensure that every concrete object in the super-class is an instance of at least one of the partition classes, then the partition is called an exhaustive partition.
=== Attributes ===
Objects in the ontology can be described by assigning attributes to them. Each attribute has at least a name and a value, and is used to store information that is specific to the object it is attached to. For example the Ford Explorer object has attributes such as:
* ''Name'': Ford Explorer
* ''Number-of-doors'': 4
* ''Engine'': {4.0L, 4.6L}
* ''Transmission'': 6-speed
The value of an attribute can be a complex [[data type]]; in this example, the value of the attribute called ''Engine'' is a list of values, not just a single value.
If you did not define attributes for the concepts you would have either a [[taxonomy]] (if [[hyponym]] relationships exist between concepts) or a '''controlled vocabulary'''. These are useful, but are not considered true ontologies.
===Relationships===
An important use of attributes is to describe the relationships (also known as relations) between objects in the ontology. Typically a relation is an attribute whose value is another object in the ontology. For example in the ontology that contains the Ford Explorer and the [[Ford Bronco]], the Ford Bronco object might have the following attribute:
* ''Successor'': Ford Explorer
This tells us that the Explorer is the model that replaced the Bronco. Much of the power of ontologies comes from the ability to describe these relations. Together, the set of relations describes the [[semantics]] of the domain.
The most important type of relation is the [[subsumption]] relation (''is-[[superclass]]-of'', the converse of ''[[is-a]]'', ''is-subtype-of'' or ''is-[[subclass]]-of''). This defines which objects are members of classes of objects. For example we have already seen that the Ford Explorer ''is-a'' 4-wheel drive, which in turn ''is-a'' Car:
The addition of the is-a relationships has created a hierarchical [[taxonomy]]; a tree-like structure (or, more generally, a [[partially ordered set]]) that clearly depicts how objects relate to one another. In such a structure, each object is the 'child' of a 'parent class' (Some languages restrict the is-a relationship to one parent for all nodes, but many do not).
Another common type of relations is the [[meronymy]] relation, written as ''part-of'', that represents how objects combine together to form composite objects. For example, if we extended our example ontology to include objects like Steering Wheel, we would say that "Steering Wheel is-part-of Ford Explorer" since a steering wheel is one of the components of a Ford Explorer. If we introduce meronymy relationships to our ontology, we find that this simple and elegant tree structure quickly becomes complex and significantly more difficult to interpret manually. It is not difficult to understand why; an entity that is described as 'part of' another entity might also be 'part of' a third entity. Consequently, entities may have more than one parent. The structure that emerges is known as a [[directed acyclic graph]] (DAG).
As well as the standard is-a and part-of relations, ontologies often include additional types of relation that further refine the semantics they model. These relations are often domain-specific and are used to answer particular types of question.
For example in the domain of automobiles, we might define a ''made-in'' relationship which tells us where each car is built. So the Ford Explorer is ''made-in'' [[Louisville, Kentucky|Louisville]]. The ontology may also know that Louisville is-in [[Kentucky]] and Kentucky is-a state of the [[United States|USA]]. Software using this ontology could now answer a question like "which cars are made in the U.S.?"
== Domain ontologies and upper ontologies ==
A domain ontology (or domain-specific ontology) models a specific domain, or part of the world. It represents the particular meanings of terms as they apply to that domain. For example the word ''[[card]]'' has many different meanings. An ontology about the domain of [[poker]] would model the "[[playing card]]" meaning of the word, while an ontology about the domain of [[computer hardware]] would model the "[[punch card]]" and "[[video card]]" meanings.
An [[Upper ontology (computer science)|upper ontology]] (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies. It contains a [[core glossary]] in whose terms objects in a set of domains can be described. There are several standardized upper ontologies available for use, including [[Dublin Core]], [[General Formal Ontology|GFO]], [[OpenCyc]]/[[ResearchCyc]], [[Suggested Upper Merged Ontology|SUMO]], and [http://www.loa-cnr.it/DOLCE.html DOLCE]l. [[WordNet]], while considered an upper ontology by some, is not an ontology: it is a unique combination of a [[taxonomy]] and a controlled vocabulary (see above, under Attributes).
The [[Gellish]] ontology is an example of a combination of an upper and a domain ontology.
Since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible. As systems that rely on domain ontologies expand, they often need to merge domain ontologies into a more general representation. This presents a challenge to the ontology designer. Different ontologies in the same domain can also arise due to different perceptions of the domain based on cultural background, education, ideology, or because a different representation language was chosen.
At present, merging ontologies is a largely manual process and therefore time-consuming and expensive. Using a foundation ontology to provide a common definition of core terms can make this process manageable. There are studies on generalized techniques for merging ontologies, but this area of research is still largely theoretical.
== Ontology languages ==
An [[ontology language]] is a [[formal language]] used to encode the ontology. There are a number of such languages for ontologies, both proprietary and standards-based:
* [[Web Ontology Language|OWL]] is a language for making ontological statements, developed as a follow-on from [[Resource Description Framework|RDF]] and [[RDFS]], as well as earlier ontology language projects including [[Ontology Inference Layer|OIL]], [[DARPA Agent Markup Language|DAML]] and [[DAMLplusOIL|DAML+OIL]]. OWL is intended to be used over the [[World Wide Web]], and all its elements (classes, properties and individuals) are defined as RDF [[resource (Web)|resources]], and identified by [[Uniform Resource Identifier|URI]]s.
* [[KIF]] is a syntax for [[first-order logic]] that is based on [[S-expression]]s.
* The [[Cyc]] project has its own ontology language called [[CycL]], based on [[first-order predicate calculus]] with some higher-order extensions.
* [[Rule Interchange Format]] (RIF) and [[F-Logic]] combine ontologies and rules.
* The [[Gellish]] language includes rules for its own extension and thus integrates an ontology with an ontology language.
== Relation to the philosophical term ==
The term ''ontology'' has its origin in [[ontology|philosophy]], where it is the name of one fundamental branch of [[metaphysics]], concerned with analyzing various types or modes of ''existence'', often with special attention to the relations between particulars and universals, between intrinsic and extrinsic properties, and between essence and existence. According to [[Tom Gruber]] at [[Stanford University]], the meaning of ''ontology'' in the context of computer science is “a description of the concepts and relationships that can exist for an [[Software agent|agent]] or a community of agents.” He goes on to specify that an ontology is generally written, “as a set of definitions of formal vocabulary.”
What ontology has in common in both computer science and philosophy is the representation of entities, ideas, and events, along with their properties and relations, according to a system of categories. In both fields, one finds considerable work on problems of ontological relativity (e.g. [[Quine]] and [[Kripke]] in philosophy, [[John F. Sowa|Sowa]] and [[Nicola Guarino|Guarino]] in computer science (Top-level ontological categories. By: Sowa, John F. In International Journal of Human-Computer Studies, v. 43 (November/December 1995) p. 669-85.), and debates concerning whether a normative ontology is viable (e.g. debates over [[foundationalism]] in philosophy, debates over the [[Cyc]] project in AI).
Differences between the two are largely matters of focus. Philosophers are less concerned with establishing fixed, controlled vocabularies than are researchers in computer science, while computer scientists are less involved in discussions of first principles (such as debating whether there are such things as fixed essences, or whether entities must be ontologically more primary than processes). During the second half of the 20th century, philosophers extensively debated the possible methods or approaches to building ontologies, without actually ''building'' any very elaborate ontologies themselves. By contrast, computer scientists were building some large and robust ontologies (such as [[WordNet]] and [[Cyc]]) with comparatively little debate over ''how'' they were built.
In the early years of the 21st century, the interdisciplinary project of [[cognitive science]] has been bringing the two circles of scholars closer together. For example, there is talk of a "computational turn in philosophy" which includes philosophers analyzing the formal ontologies of computer science (sometimes even working directly with the software), while researchers in computer science have been making more references to those philosophers who work on ontology (sometimes with direct consequences for their methods). Still, many scholars in both fields are uninvolved in this trend of cognitive science, and continue to work independently of one another, pursuing separately their different concerns.
==Resources==
===Examples of published ontologies ===
* [[Dublin Core]], a simple ontology for documents and publishing.
* [[Cyc]] for formal representation of the universe of discourse.
* [[Suggested Upper Merged Ontology]], which is a formal upper ontology
* [http://www.ifomis.org/bfo/ Basic Formal Ontology (BFO)], a formal upper ontology designed to support scientific research
* [[Gellish English dictionary]], an ontology that includes a dictionary and taxonomy that includes an upper ontology and a lower ontology that focusses on industrial and business applications in engineering, technology and procurement.
* [http://www.fb10.uni-bremen.de/anglistik/langpro/webspace/jb/gum/index.htm Generalized Upper Model], a linguistically-motivated ontology for mediating between clients systems and natural language technology
* [[WordNet]] Lexical reference system
* [[OBO Foundry]]: a suite of interoperable reference ontologies in biomedicine.
* The [[Ontology for Biomedical Investigations]] is an open access, integrated ontology for the description of biological and clinical investigations.
* [http://colab.cim3.net/file/work/SICoP/ontac/COSMO/ COSMO]: An OWL ontology that is a merger of the basic elements of the OpenCyc and SUMO ontologies, with additional elements.
* [[Gene Ontology]] for [[genomics]]
* [http://pir.georgetown.edu/pro/ PRO], the Protein Ontology of the Protein Information Resource, Georgetown University.
* [http://proteinontology.info/ Protein Ontology] for [[proteomics]]
* [http://sig.biostr.washington.edu/projects/fm/AboutFM.html Foundational Model of Anatomy] for human anatomy
* [[SBO]], the Systems Biology Ontology, for computational models in biology
* [http://www.plantontology.org/ Plant Ontology] for plant structures and growth/development stages, etc.
* [[CIDOC|CIDOC CRM]] (Conceptual Reference Model) - an ontology for "[[cultural heritage]] information".
* [http://www.linguistics-ontology.org/gold.html GOLD ] ('''G'''eneral '''O'''ntology for [[descriptive linguistics|'''L'''inguistic '''D'''escription ]])
* [http://www.landcglobal.com/pages/linkbase.php Linkbase] A formal representation of the biomedical domain, founded upon [http://www.ifomis.org/bfo/ Basic Formal Ontology (BFO)].
* [http://www.loa-cnr.it/Ontologies.html Foundational, Core and Linguistic Ontologies]
* [[ThoughtTreasure]] ontology
* [[LPL]] Lawson Pattern Language
* [[TIME-ITEM]] Topics for Indexing Medical Education
* [[POPE]] Purdue Ontology for Pharmaceutical Engineering
* [[IDEAS Group]] A formal ontology for enterprise architecture being developed by the Australian, Canadian, UK and U.S. Defence Depts. [http://www.ideasgroup.org The IDEAS Group Website]
* [http://www.eden-study.org/articles/2007/problems-ontology-programs_ao.pdf program abstraction taxonomy]
* [http://sweet.jpl.nasa.gov/ SWEET] Semantic Web for Earth and Environmental Terminology
* [http://www.cellcycleontology.org/ CCO] The Cell-Cycle Ontology is an application ontology that represents the cell cycle
===Ontology libraries===
The development of ontologies for the Web has led to the apparition of services providing lists or directories of ontologies with search facility. Such directories have been called ontology libraries.
The following are static libraries of human-selected ontologies.
* The [http://www.daml.org/ontologies/ DAML Ontology Library] maintains a legacy of ontologies in DAML.
* The [http://protegewiki.stanford.edu/index.php/Protege_Ontology_Library Protege Ontology Library] contains a set of owl, Frame-based and other format ontologies.
* [http://www.schemaweb.info/ SchemaWeb] is a directory of RDF schemata expressed in RDFS, OWL and DAML+OIL.
The following are both directories and search engines. They include crawlers searching the Web for well-formed ontologies.
* [[Swoogle]] is a directory and search engine for all RDF resources available on the Web, including ontologies.
* The [http://olp.dfki.de/OntoSelect/ OntoSelect] Ontology Library offers similar services for RDF/S, DAML and OWL ontologies.
* [http://www.w3.org/2004/ontaria/ Ontaria] is a "searchable and browsable directory of semantic web data", with a focus on RDF vocabularies with OWL ontologies.
* The [http://www.obofoundry.org/ OBO Foundry / Bioportal]is a suite of interoperable reference ontologies in biology and biomedicine.
OpenOffice.org
'''OpenOffice.org''' ('''OO.o''' or '''OOo''') is a [[cross-platform]] [[office suite|office application suite]] available for a number of different computer [[operating system]]s. It supports the ISO standard '''[[OpenDocument]] Format (ODF)''' for data interchange as its default [[file format]], as well as [[Microsoft Office]] '97–2003 formats, [[Microsoft Office]] '2007 format (in version 3), among many others.
OpenOffice.org was originally derived from [[StarOffice]], an office suite developed by [[StarDivision]] and acquired by [[Sun Microsystems]] in August 1999. The [[source code]] of the suite was released in July 2000 with the aim of reducing the dominant [[market share]] of [[Microsoft Office]] by providing a free, open and high-quality alternative; later versions of StarOffice are based upon OpenOffice.org with additional proprietary components. OpenOffice.org is [[free software]], available under the [[GNU Lesser General Public License]] (LGPL).
The project and software are informally referred to as ''OpenOffice'', but this term is a [[trademark]] held by another party, requiring the project to adopt ''OpenOffice.org'' as its formal name.
== History==
Originally developed as the [[proprietary software]] application suite StarOffice by the German company [[StarDivision]], the code was purchased in 1999 by Sun Microsystems. In August 1999 version 5.2 of StarOffice was made available free of charge.
On [[July 19]], [[2000]], Sun Microsystems announced that it was making the source code of StarOffice available for download under both the LGPL and the [[Sun Industry Standards Source License]] (SISSL) with the intention of building an open source development community around the software. The new project was known as OpenOffice.org, and its website went live on [[October 13]], [[2000]].
Work on version 2.0 began in early 2003 with the following goals: better interoperability with Microsoft Office; better performance, with improved speed and lower memory usage; greater [[Scripting language|scripting]] capabilities; better integration, particularly with [[GNOME]]; an easier-to-find and use database front-end for creating reports, forms and queries; a new built-in [[SQL]] database; and improved [[usability]]. A [[beta version]] was released on [[March 4]], [[2005]].
On [[September 2]], [[2005]] Sun announced that it was retiring the SISSL. As a consequence, the OpenOffice.org Community Council announced that it would no longer [[dual license]] the office suite, and future versions would use only the LGPL.
On [[October 20]], [[2005]], OpenOffice.org 2.0 was formally released to the public. Eight weeks after the release of Version 2.0, an update, OpenOffice.org 2.0.1, was released. It fixed minor bugs and introduced new features.
As of the 2.0.3 release, OpenOffice.org changed its release cycle from 18-months to releasing updates, feature enhancements and bug fixes every three months. Currently, new versions including new features are released every six months (so-called "feature releases") alternating with so-called "bug fix releases" which are being released between two feature releases (Every 3 months).
=== StarOffice ===
Sun subsidizes the development of OpenOffice.org in order to use it as a base for its commercial [[proprietary software|proprietary]] StarOffice application software. Releases of StarOffice since version 6.0 have been based on the OpenOffice.org source code, with some additional proprietary components, including:
* Additional bundled fonts (especially [[CJK|East Asian language]] fonts).
* [[Adabas D]] database.
* Additional document [[Template (word processing)|templates]].
* [[Clip art]].
* Sorting functionality for Asian versions.
* Additional file filters.
* Migration assessment tool (Enterprise Edition).
* Macro migration tool (Enterprise Edition).
* Configuration management tool (Enterprise Edition).
OpenOffice.org, therefore, inherited many features from the original StarOffice upon which it was based including the [[OpenOffice.org XML]] file format which it retained until version 2, when it was replaced by the ISO standard [[OpenDocument]] Format (ODF).
== Features ==
According to its [[mission statement]], the OpenOffice.org project aims "''To create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format.''"
OpenOffice.org aims to compete with Microsoft Office and emulate its look and feel where suitable. It can read and write most of the [[file formats]] found in Microsoft Office, and many other applications; an essential feature of the suite for many users. OpenOffice.org has been found to be able to open files of older versions of Microsoft Office and damaged files that newer versions of Microsoft Office itself cannot open. However, it cannot open older Word for Macintosh (MCW) files.
=== Platforms ===
Platforms for which OO.o is available include [[Microsoft Windows]], [[Linux]], [[Solaris Operating System|Solaris]], [[BSD]], [[OpenVMS]], [[OS/2]] and [[IRIX]]. The current primary development platforms are Microsoft Windows, Linux and Solaris.
A port for [[Mac OS X]] exists for OS X machines which have the [[X Window System]] component installed. A port to OS X's native [[Aqua (user interface)|Aqua user interface]] is in progress, and is scheduled for completion for the 3.0 milestone. [[NeoOffice]] is an independent [[Fork (software development)|fork]] of OpenOffice, specially adapted for Mac OS X.
=== Version compatibility ===
*Windows 95: up to v1.1.5
*Windows 98-Vista: up to v2.4, development releases of v3.0
*Mac OS 10.2: up to v1.1.2
*Mac OS 10.3: up to v2.1
*Mac OS 10.4-10.5: up to v2.4, development releases of v3.0 ([[Apple-Intel architecture|intel]] only)
*OS/2 and eComStation: up to v2.0.4
=== Components ===
OpenOffice.org is a collection of applications that work together closely to provide the features expected from a modern office suite. Many of the components are designed to mirror those available in Microsoft Office. The components available include:
*[[QuickStart]]er
:A small program for Windows and Linux that runs when the computer starts for the first time. It loads the core files and libraries for OpenOffice.org during computer startup and allows the suite applications to start more quickly when selected later. The amount of time it takes to open OpenOffice.org applications was a common complaint in version 1.0 of the suite. Substantial improvements were made in this area for version 2.2.
*The [[Macro (computer science)|macro]] recorder
:Is used to record user actions and replay them later to help with automating tasks, using [[OpenOffice.org Basic]] (see [[OpenOffice.org#OpenOffice.org Basic|below]]).
It is not possible to download these components individually on Windows, though they can be installed separately. Most Linux distributions break the components into individual packages which may be downloaded and installed separately.
=== OpenOffice.org Basic ===
OpenOffice.org Basic is a programming language similar to Microsoft [[Visual Basic for Applications]] (VBA) based on [[StarOffice Basic]]. In addition to the macros, the upcoming Novell edition of OpenOffice.org 2.0 supports running Microsoft VBA macros, a feature expected to be incorporated into the mainstream version soon.
OpenOffice.org Basic is available in the Writer and Calc applications. It is written in functions called subroutines or macros, with each macro performing a different task, such as counting the words in a paragraph. OpenOffice.org Basic is especially useful in doing repetitive tasks that have not been integrated in the program.
As the OpenOffice.org database, called "Base", uses documents created under the Writer application for reports and forms, one could say that Base can also be programmed with OpenOffice.org Basic.
== File formats ==
OpenOffice.org pioneered the ISO/IEC standard [[OpenDocument]] file formats (ODF), which it uses natively, by default. It also supports reading (and in some cases writing) a large number of legacy proprietary file formats (e.g.: [[WordPerfect]] through libwpd, [[StarOffice]], [[Lotus software]], [[Microsoft Works|MS Works]] through libwps, [[Rich Text Format]]), most notably including [[Microsoft Office]] formats
after which the OpenDocument specification was "approved for release as an ISO and IEC International Standard" under the name ISO/IEC 26300:2006..
=== Microsoft Office interoperability ===
In response to Microsoft's recent movement towards using the [[Office Open XML]] format in [[Microsoft Office 2007]], [[Novell]] has released an [[Office Open XML]] converter for OOo under a liberal [[BSD license]] (along with [[GNU GPL]] and [[LGPL]] licensed libraries), that will be submitted for inclusion into the OpenOffice.org project. This allows OOo to read and write Microsoft OpenXML-formatted word processing documents (.docx) in OpenOffice.org. Currently it works only with the latest Novell edition of OpenOffice.org. [[Sun Microsystems]] has developed an ODF plugin for Microsoft Office which enables users of Microsoft Office Word, Excel and PowerPoint to read and write ODF documents. The plugin currently works with Microsoft Office 2003, Microsoft Office XP and Microsoft Office 2000. Support for Microsoft Office 2007 is only available in combination with Microsoft Office 2007 SP1.
Several software companies (including Microsoft and Novell) are working on an add-in for Microsoft Office that allows reading and writing ODF files. Currently it works only for Microsoft Word 2007 / XP / 2003.
Microsoft provides a compatibility pack to read and write Office Open XML files with Office 2000, XP and 2003. The compatibility pack can also be used as a stand-alone converter with Microsoft Office 97. This might be helpful to convert older Microsoft Office files via Office Open XML to ODF if a direct conversion doesn't work as expected. The Office compatibility pack however does not install for Office 2000 or Office XP on [[Windows 9x]].
Note that some office applications built with Microsoft components may refuse to import OpenOffice data. [[The Sage Group]]'s Simply Accounting, for example, can import Excel's .xls files, but refuses to accept OpenOffice.org-generated .xls files for the reason that the OOo .xls files are not "genuine Microsoft" .xls files.
== Development ==
=== Overview ===
The OpenOffice.org [[Application Programming Interface|API]] is based on a component technology known as [[Universal Network Objects]] (UNO). It consists of a wide range of interfaces defined in a [[CORBA]]-like [[interface description language]].
The [[document file format]] used is based on [[XML]] and several export and import filters. All external formats read by OpenOffice.org are converted back and forth from an internal XML representation. By using [[data compression|compression]] when saving [[XML]] to disk, files are generally smaller than the equivalent binary Microsoft Office documents. The native file format for storing documents in version 1.0 was used as the basis of the [[OASIS (organization)|OASIS]] OpenDocument file format standard, which has become the default file format in version 2.0.
Development versions of the suite are released every few weeks on the developer zone of the OpenOffice.org website. The releases are meant for those who wish to test new features or are simply curious about forthcoming changes; they are not suitable for production use.
=== Native desktop integration ===
OpenOffice.org 1.0 was criticized for not having the [[look and feel]] of applications developed natively for the platforms on which it runs. Starting with version 2.0, OpenOffice.org uses native [[widget toolkit]], icons, and font-rendering libraries across a variety of platforms, to better match native applications and provide a smoother experience for the user. There are projects underway to further improve this integration on both [[GNOME]] and [[KDE]].
This issue has been particularly pronounced on Mac OS X, whose standard user interface looks noticeably different from either Windows or [[X11]]-based desktop environments and requires the use of programming toolkits unfamiliar to most OpenOffice.org developers. There are two implementations of OpenOffice.org available for OS X:
;OpenOffice.org Mac OS X (X11): This official implementation requires the installation of [[X11.app]] or [[XDarwin]], and is a close port of the well-tested Unix version. It is functionally equivalent to the Unix version, and its user interface resembles the [[look and feel]] of that version; for example, the application uses its own [[menu bar]] instead of the OS X menu at the top of the screen. It also requires system fonts to be converted to X11 format for OpenOffice.org to use them (which can be done during application installation).
;OpenOffice.org Aqua: After a first step (completed) using [[Carbon (API)|Carbon]], OpenOffice.org Aqua switched to [[Cocoa (API)|Cocoa]] technology, and an [[Aqua (GUI)|Aqua]] version (based on [[Cocoa (API)|Cocoa]]) is also being developed under the aegis of OpenOffice.org, with a Beta version currently available. Sun Microsystems is collaborating with OOo to further development of the Aqua version of OpenOffice.org for Mac.
=== Future ===
Currently, a developed preview of OpenOffice.org 3 (OOo-dev 3.0) is available for download.
Among the planned features for OOo 3.0, set to be released by September 2008 , are:
* Personal Information Manager ([[Personal Information Manager|PIM]]), probably based on [[Mozilla Thunderbird|Thunderbird]]/[[Lightning (software)|Lightning]]
* PDF import into Draw (to maintain correct layout of the original PDF)
* [[OOXML]] document support for opening documents created in [[Office 2007]]
* Support for [[Mac OS X]] [[Aqua (user interface)|Aqua]] platform
* Extensions, to add third party functionality.
* Presenter screen in Impress with multi-screen support
=== Other projects ===
A number of products are [http://wiki.services.openoffice.org/wiki/DerivedWorks derived from OpenOffice.org]. Among the more well-known ones are Sun StarOffice and NeoOffice. The OpenOffice.org site also lists a large variety of [http://wiki.services.openoffice.org/wiki/OpenOffice.org_Solutions complementary products] including groupware solutions.
==== NeoOffice ====
[[NeoOffice]] is an independent [[porting|port]] that integrates with [[Mac OS X|OS X]]’s [[Aqua (GUI)|Aqua]] user interface using [[Java platform|Java]], [[Carbon (API)|Carbon]] and (increasingly) [[Cocoa (API)|Cocoa]] toolkits. NeoOffice adheres fairly closely to OS X UI standards (for example, using native pull-down menus), and has direct access to OS X’s installed fonts and printers. Its releases lag behind the official OpenOffice.org X11 releases, due to its small development team and the concurrent development of the technology used to port the user interface.
Other projects run alongside the main OpenOffice.org project and are easier to contribute to. These include documentation, [[internationalisation and localisation]] and the API.
==== OpenGroupware.org ====
[[OpenGroupware.org]] is a set of extension programs to allow the sharing of OpenOffice.org documents, calendars, address books, [[e-mail]]s, [[instant messenger|instant messaging]] and blackboards, and provide access to other [[collaborative software|groupware]] applications.
There is also an effort to create and share assorted document templates and other useful additions at OOExtras.
A set of [[Perl]] extensions is available through the [[CPAN]] in order to allow OpenOffice.org document processing by external programs. These libraries do not use the OpenOffice.org API. They directly read or write the OpenOffice.org files using Perl standard file [[codec|compression/decompression]], XML access and [[UTF-8]] encoding modules.
==== Portable ====
A distribution of OpenOffice.org called OpenOffice.org Portable is designed to run the suite from a [[USB flash drive]].
==== OxygenOffice Professional ====
An enhancement of OpenOffice.org, providing:
Current Version: 2.4
* Possibility to run Visual Basic for Application (VBA) macros in Calc (for testing)
* Improved Calc HTML export
* Enhanced Access support for Base
* Security fixes
* Enhanced performance
* Enhanced color-palette
* Enhanced help menu, additional User’s Manual, and extended tips for beginners
Optionally it provides, free for personal and professional use:
* More than 3,200 graphics, both clip art and photos.
* Several templates and sample documents
* Over 90 free fonts.
* Additional tools like OOoWikipedia
====Extensions====
Since version 2.0.4, OpenOffice.org has supported extensions in a similar manner to [[Mozilla Firefox]]. Extensions make it easy to add new functionality to an existing OpenOffice.org installation. The [http://extensions.services.openoffice.org/most_pop_ext OpenOffice.org Extension Repository] lists already more than 80 extensions. Developers can easily build new extensions for OpenOffice.org, for example by using the [http://wiki.services.openoffice.org/wiki/OpenOffice_NetBeans_Integration OpenOffice.org API Plugin for NetBeans].
==== The OpenOffice.org Bibliographic Project ====
This aims to incorporate a powerful [[reference management software]] into the suite. The new major addition is slated for inclusion with the standard OpenOffice.org release on late-2007 to mid-2008, or possibly later depending upon the availability of programmers.
=== Security ===
OpenOffice.org includes a security team, and as of June 2008 the security organization [[Secunia]] reports no known unpatched security flaws for the software. [[Kaspersky Lab]] has shown a [[proof of concept]] virus for OpenOffice.org. This shows OOo viruses are possible, but there is no known virus "in the wild".
In a private meeting of the French Ministry of Defense, macro-related security issues were raised. OpenOffice.org developers have responded and noted that the supposed vulnerability had not been announced through "well defined
procedures" for disclosure and that the ministry had revealed nothing specific. However, the developers have been in talks with the researcher concerning the supposed vulnerability.
As with Microsoft Word, documents created in OpenOffice can contain [[metadata]] which may include a complete history of what was changed, when and by whom.
== Ownership ==
The project and software are informally referred to as ''OpenOffice'', but project organizers report that this term is a [[trademark]] held by another party, requiring them to adopt ''OpenOffice.org'' as its formal name. (Due to a similar trademark issue, the [[Brazilian Portuguese]] version of the suite is distributed under the name ''BrOffice.org''.)
Development is managed by staff members of StarOffice. Some delay and difficulty in implementing external contributions to the core codebase (even those from the project's corporate sponsors) has been noted.
Currently, there are [http://wiki.services.openoffice.org/wiki/DerivedWorks several derived and/or proprietary works based on OOo], with some of them being:
* Sun Microsystem's [[StarOffice]], with various complementary add-ons.
* IBM's [[Lotus Symphony]], with a new interface based on [[Eclipse (software)|Eclipse]] (based on OO.o 1.x).
* OpenOffice.org Novell edition, integrated with [[Novell Evolution|Evolution]] and with a [[OOXML]] filter.
* Beijing [[Redflag]] Chinese 2000's [[RedOffice]], fully localized in Chinese characters.
* Planamesa's [[NeoOffice]] for [[Mac OS X]] with Aqua support via Java.
In [[May 23]], [[2007]], the OpenOffice.org community and Redflag Chinese 2000 Software Co, Ltd. announced a joint development effort focused on integrating the new features that have been added in the RedOffice localization of OpenOffice.org, as well as quality assurance and work on the core applications. Additionally, Redflag Chinese 2000 made public its commitment to the global OO.o community stating it would "strengthen its support of the development of the world's leading free and open source productivity suite", adding around 50 engineers (that have been working on RedOffice since 2006) to the project.
In [[September 10]], [[2007]], the OO.o community announced that [[IBM]] had joined to support the development of OpenOffice.org. "IBM will be making initial code contributions that it has been developing as part of its Lotus Notes product, including accessibility enhancements, and will be making ongoing contributions to the feature richness and code quality of OpenOffice.org. Besides working with the community on the free productivity suite's software, IBM will also leverage OpenOffice.org technology in its products" as has been seen with [[Lotus Symphony]]. Sean Poulley, the vice president of business and strategy in IBM's [[Lotus Software]] division said that IBM plans to take a leadership role in the OpenOffice.org community together with other companies such as Sun Microsystems. IBM will work within the leadership structure that exists.
As of [[October 02]], [[2007]], [[Michael Meeks]] announced (and generated an answer by Sun's [[Simon Phipps]] and Mathias Bauer) a derived OpenOffice.org work, under the wing of his employer [[Novell]], with the purpose of including new features and fixes that do not get easily integrated in the OOo-build up-stream core. The work is called Go-OO (http://go-oo.org/) a name under which alternative OO.o software has been available for five years. The new features are shared with Novell's edition of OOo and include:
* [[Visual Basic for Applications|VBA]] macros support.
* Faster start up time.
* "A [[Linear programming|linear optimization]] solver to optimize a cell value based on arbitrary constraints built into Calc".
* Multimedia content supports into documents, using the [[gstreamer]] multimedia framework.
* Support for [[Microsoft Works]] formats, [[WordPerfect]] graphics (WPG format) and T602 files imports.
[http://wiki.services.openoffice.org/wiki/Contributing_Patches Details about the patch handling including metrics] can be found on the OpenOffice.org site.
== Reactions ==
Federal Computer Week issue listed OpenOffice.org as one of the "5 stars of open-source products." In contrast, OpenOffice.org was used in [[2005]] by ''[[The Guardian]]'' newspaper to illustrate what it claims are the limitations of open-source software, although the article does finish by stating that the software may be better than MS Word for books.
=== Market share ===
It is extremely difficult to estimate the market share of OpenOffice.org due to the fact that OpenOffice.org can be freely distributed via download sites including mirrors, peer-to-peer networks, CDs, Linux distros, etc. Nevertheless, the OpenOffice.org tries to capture key adoption data in a market share analysis
Although Microsoft Office retains 95% of the general market as measured by revenue, OpenOffice.org and StarOffice have secured 14% of the large enterprise market as of 2004 and 19% of the small to midsize business market in 2005. The OpenOffice.org web site reports more than 98 million downloads.
Other large scale users of OpenOffice.org include [[Ministry of Defence (Singapore)|Singapore’s Ministry of Defence]], and [[Bristol]] City Council in the UK. In [[France]], OpenOffice.org has attracted the attention of both local and national government administrations who wish to rationalize their software procurement, as well as have stable, standard file formats for archival purposes. It is now the official office suite for the [[French Gendarmerie]]. Several government organizations in India, such as [[IIT Bombay]] (a renowned technical institute), the [[Supreme Court of India]], the [[Allahabad High Court]], which use Linux, completely rely on OpenOffice.org for their administration.
On [[October 4]], [[2005]], Sun and [[Google]] announced a strategic partnership. As part of this agreement, Sun will add a Google search bar to OpenOffice.org, Sun and Google will engage in joint marketing activities as well as joint research and development, and Google will help distribute OpenOffice.org. Google is currently distributing StarOffice as part of the [[Google Pack]].
Besides StarOffice, there are still a number of OpenOffice.org derived commercial products. Most of them are developed under [[SISSL]] license (which is valid up to OpenOffice.org 2.0 Beta 2). In general they are targeted at local or niche market, with proprietary add-ons such as speech recognition module, automatic database connection, or better [[CJK]] support.
In July 2007 Everex, a division of First International Computer and the 9th largest PC supplier in the U.S., began shipping systems preloaded with OpenOffice.org 2.2 into Wal-Mart and Sam's Club throughout North America.
In September 2007 IBM announced that it would supply and support OpenOffice.org branded as [[Lotus Symphony]], and integrated into Lotus Notes. IBM also announced 35 developers would be assigned to work on OpenOffice.org, and that it would join the OpenOffice.org foundation. Commentators noted parallels between IBM's 2000 support of Linux and this announcement.
=== Java controversy ===
In the past OpenOffice.org was criticized for an increasing dependency on the [[Java Runtime Environment]] which was not [[free software]]. That Sun Microsystems is both the creator of Java and the chief supporter of OpenOffice.org drew accusations of ulterior motives for this technology choice.
Version 1 depended on the [[Java Runtime Environment]] (JRE) being present on the user’s computer for some auxiliary functions, but version 2 increased the suite’s use of Java requiring a JRE. In response, [[Red Hat]] increased their efforts to improve [[free Java implementations]]. Red Hat’s [[Fedora (Linux distribution)|Fedora Core]] 4 (released on [[June 13]], [[2005]]) included a beta version of OpenOffice.org version 2, running on [[GNU Compiler for Java|GCJ]] and [[GNU Classpath]].
The issue of OpenOffice.org’s use of Java came to the fore in May 2005, when [[Richard Stallman]] appeared to call for a [[fork (software)|fork]] of the application in a posting on the [[Free Software Foundation]] website. This led to discussions within the OpenOffice.org community and between Sun staff and developers involved in [[GNU Classpath]], a free replacement for Sun’s Java implementation. Later that year, the OpenOffice.org developers also placed into their development guidelines various requirements to ensure that future versions of OpenOffice.org could be run on free implementations of Java and fixed the issues which previously prevented OpenOffice.org 2.0 from using free software Java implementations.
On [[November 13]], [[2006]], Sun committed to releasing Java under the [[GNU General Public License]] in the near future. This process would end OpenOffice.org's dependence on [[non-free]] software.
Between November 2006 and May 2007, Sun Microsystems made available most of their Java technologies under the GNU General Public License, in compliance with the specifications of the Java Community Process, thus making almost all of Sun's Java also free software.
The following areas of OpenOffice.org 2.0 depend on the JRE being present:
* The [[media player (application software)|media player]] on Unix-like systems
* All document wizards in Writer
* Accessibility tools
* Report Autopilot
* [[JDBC]] driver support
* [[Hsqldb|HSQL]] database engine, which is used in OpenOffice.org Base
* [[XSLT]] filters
* [[BeanShell]], the [[NetBeans]] scripting language and the Java UNO bridge
* Export filters to the Aportis.doc (.pdb) format for the [[Palm OS]] or [[Pocket Word]] (.psw) format for the [[Pocket PC]]
* Export filter to [[LaTeX]]
* Export filter to [[MediaWiki]]'s [[wikitext]]
A common point of confusion is that [[mail merge]] to generate emails requires the Java API JavaMail in [[StarOffice]]; however, as of version 2.0.1, OpenOffice.org uses a [[Python (programming language)|Python]]-component instead.
=== Complementary software ===
OpenOffice.org provides replacement for MS Office's [[Microsoft Word]], [[Microsoft Excel]], [[Microsoft PowerPoint]], [[Microsoft Access]], [[Equation Editor|Microsoft Equation Editor]] and [[Microsoft Visio]]. But to level the equivalent functionality from the rest of MS Office, OOo can be complemented with other open source programs such as:
* [[Novell Evolution|Evolution]] or [[Mozilla Thunderbird|Thunderbird]]/[[Lightning (software)|Lightning]] for a PIM like [[Microsoft Outlook]].
* [[OpenProj]] (which seeks integration with OOo, but might be limited due to licensing issues) for [[Microsoft Project]].
* [[Scribus]] for [[Microsoft Publisher]]
* [[O3spaces]] for [[Sharepoint]]
Microsoft also provides Administrative Template Files ("adm files") that allow MS Office to be configured using Windows Group Policy. Equivalent functionality for OpenOffice.org is provided by [http://openoffice-enterprise.com/ OpenOffice-Enterprise], a commercial product from Open Office Technology, Inc.
=== Issues ===
OpenOffice.org has been criticized for slow start times and extensive CPU and RAM usage in comparison to other competitive software such as Microsoft Office. In comparison, tests between OpenOffice.org 2.2 and Microsoft Office 2007 have found that OpenOffice.org takes approximately 2 times the processing time and memory to load itself along with a blank file; and took approximately 4.7 times the processing time and 3.9 times the memory to open an extremely large spreadsheet file. Critics have pointed to excessive code bloat and OpenOffice.org's loading of the [[Java Virtual Machine|Java Runtime Environment]] as possible reasons for the slow speeds and excessive memory usage.
However, since OpenOffice.org 2.2 the performance of OpenOffice.org has been improved dramatically.
One of the greatest challenges is its ability to be truly cross compatible with other applications. Since Openoffice.org is forced to reverse engineer proprietary binary formats due to unavailability of open specifications, slight formatting incompatibilities tend to exist when files are saved in non-native format. For example, a complex .doc document formatted under OpenOffice.org, is usually not displayed with the correct format when opened with Microsoft Office.
== Retail ==
The [[free software license]] under which OpenOffice.org is distributed allows unlimited use of the software for both home and business use, including unlimited redistribution of the software. Several businesses sell the OpenOffice.org suite on auction websites such as [[eBay]], offering value-added services such as 24/7 technical support, download mirrors, and CD mailing. However, often the 24/7 support offered is not provided by the company selling the software, but rather by the official OpenOffice.org mailing list.
Parsing
In [[computer science]] and [[linguistics]], '''parsing''', or, more formally, '''syntactic analysis''', is the process of analyzing a sequence of [[Token (parser)|tokens]] to determine grammatical structure with respect to a given (more or less) [[formal grammar]]. A '''parser''' is thus one of the components in an [[interpreter]] or [[compiler]], where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of [[parse tree]], [[abstract syntax tree]] or other hierarchical structure) and normally checks for syntax errors at the same time. The parser often uses a separate [[lexical analyser]] to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as [[Yet Another Compiler Compiler|Yacc]]) from a grammar written in [[Backus-Naur form]].
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of [[Inflection|inflected]] languages, such as the [[Romance languages|Romance languages]] or [[Latin]].
Parsers can also be constructed as executable specifications of grammars in functional programming languages. Frost, Hafiz and Callaghan have built on the work of others to construct a set of [[higher-order function]]s (called [[parser combinators]]) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions. The [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] site has more about the algorithms and implementation details.
== Human languages ==
:''Also see [[:Category:Natural language parsing]]''
In some [[machine translation]] and [[natural language processing]] systems, human languages are parsed by computer programs. Human sentences are not easily parsed by programs, as there is substantial [[syntactic ambiguity|ambiguity]] in the structure of human language. In order to parse natural language data, researchers must first agree on the [[grammar]] to be used. The choice of syntax is affected by both [[linguistic]] and computational concerns; for instance some parsing systems use [[lexical functional grammar]], but in general, parsing for grammars of this type is known to be [[NP-complete]]. [[Head-driven phrase structure grammar]] is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn [[Treebank]]. [[Shallow parsing]] aims to find only the boundaries of major constituents such as noun phrases. Another popular strategy for avoiding linguistic controversy is [[dependency grammar]] parsing.
Most modern parsers are at least partly [[statistics|statistical]]; that is, they rely on a corpus of training data which has already been annotated (parsed by hand). This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts. ''(See [[machine learning]].)'' Approaches which have been used include straightforward [[PCFG]]s (probabilistic context free grammars), [[maximum entropy]], and [[neural net]]s. Most of the more successful systems use ''lexical'' statistics (that is, they consider the identities of the words involved, as well as their [[part of speech]]). However such systems are vulnerable to [[overfitting]] and require some kind of smoothing to be effective.
Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages. As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not [[context-free]], some kind of context-free approximation to the grammar is used to perform a first pass. Algorithms which use context-free grammars often rely on some variant of the [[CKY algorithm]], usually with some [[heuristic (computer science)|heuristic]] to prune away unlikely analyses to save time. ''(See [[chart parsing]].)'' However some systems trade speed for accuracy using, eg, linear-time versions of the [[Shift-reduce parsing|shift-reduce]] algorithm. A somewhat recent development has been [[parse reranking]] in which the parser proposes some large number of analyses, and a more complex system selects the best option.
It is normally branching of one part and its subparts
== Programming languages ==
The most common use of a parser is as a component of a [[compiler]] or [[interpreter]]. This parses the [[source code]] of a [[computer programming language]] to create some form of internal representation. Programming languages tend to be specified in terms of a [[context-free grammar]] because fast and efficient parsers can be written for them. Parsers are written by hand or generated by [[parser generator]]s.
Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.
===Overview of process===
[[image:Parser_Flow.gif|right|Flow of data in a typical parser]]
The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
The first stage is the token generation, or [[lexical analysis]], by which the input character stream is split into meaningful symbols defined by a grammar of [[regular expression]]s. For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, and 2, each of which is a meaningful symbol in the context of an arithmetic expression. The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a [[context-free grammar]] which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with [[attribute grammar]]s.
The final phase is [[Semantic analysis (computer science)|semantic parsing]] or analysis, which is working out the implications of the expression just validated and taking the appropriate action. In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code. Attribute grammars can also be used to define these actions.
==Types of parsers==
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:
*[[Top-down parsing]] - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for [[parse tree|parse-trees]] using a top-down expansion of the given [[formal grammar]] rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate [[ambiguity]] by expanding all alternative right-hand-sides of grammar rules . [[LL parser]]s and [[recursive-descent parser]] are examples of top-down parsers, which cannot accommodate [[left recursion | left recursive]] productions. Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous [[context-free grammar]]s, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodates [[ambiguity]] and [[left recursion]] in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees. Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
*[[Bottom-up parsing]] - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. [[LR parser]]s are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
Another important distinction is whether the parser generates a ''leftmost derivation'' or a ''rightmost derivation'' (see [[context-free grammar]]). LL parsers will generate a leftmost [[derivation]] and LR parsers will generate a rightmost derivation (although usually in reverse) .
== Examples of parsers ==
=== Top-down parsers ===
Some of the parsers that use [[top-down parsing]] include:
* [[Recursive descent parser]]
* [[LL parser]] ('''L'''eft-to-right, '''L'''eftmost derivation)
* [http://www.cs.uwindsor.ca/~hafiz/proHome.html X-SAIGA] - eXecutable SpecificAtIons of GrAmmars. Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
=== Bottom-up parsers ===
Some of the parsers that use [[bottom-up parsing]] include:
* Precedence parser
** [[Operator-precedence parser]]
** [[Simple precedence parser]]
* BC (bounded context) parsing
* [[LR parser]] ('''L'''eft-to-right, '''R'''ightmost derivation)
** [[SLR parser|Simple LR (SLR) parser]]
** [[LALR parser]]
** [[Canonical LR parser|Canonical LR (LR(1)) parser]]
** [[GLR parser]]
* [[CYK algorithm|CYK parser]]
Lexical category
In [[grammar]], a '''lexical category''' (also '''word class''', '''lexical class''', or in traditional grammar '''part of speech''') is a linguistic category of words (or more precisely ''lexical items''), which is generally defined by the [[syntactic]] or [[morphology (linguistics)|morphological]] behaviour of the lexical item in question. Common linguistic categories include ''noun'' and ''verb'', among others. There are [[open class word|open word classes]], which constantly acquire new members, and [[closed class word|closed word classes]], which acquire new members infrequently if at all.
Different languages may have different lexical categories, or they might associate different properties to the same one. For example, [[Japanese language|Japanese]] has at least three classes of adjectives where English has one; Chinese and Japanese have [[measure word]]s while European languages have nothing resembling them; many languages don't have a distinction between adjectives and adverbs, or adjectives and nouns, etc. Many linguists argue that the formal distinctions between parts of speech must be made within the framework of a specific language or language family, and should not be carried over to other languages or language families.
==History==
The classification of words into lexical categories is found from the earliest moments in the [[history of linguistics]]. In the ''[[Nirukta]]'', written in the [[5th century BCE|5th]] or [[6th century BCE]], the [[Sanskrit grammarian]] [[Yāska]] defined four main categories of words :
# nāma - [[noun]]s or substantives
# ākhyāta - [[verb]]s
# upasarga - pre-verbs or [[prefix]]es
# nipāta - [[Grammatical particle|particle]]s, invariant words (perhaps [[prepositions]])
These four were grouped into two large classes: [[inflection|inflected]] (nouns and verbs) and uninflected (pre-verbs and particles).
A century or two later, the [[Classical Greece|Greek]] scholar [[Plato]] wrote in the [[Cratylus (dialogue)|''Cratylus'' dialog]] that "... sentences are, I conceive, a combination of verbs [''rhēma''] and nouns [''ónoma'']". Another class, "conjunctions" (covering [[Grammatical conjunction|conjunction]]s, [[pronoun]]s, and the [[article (grammar)|article]]), was later added by [[Aristotle]].
By the end of the [[2nd century BCE]], the classification scheme had been expanded into eight categories, seen in the ''[[Art of Grammar|Tékhnē grammatiké]]'':
# Noun: a part of speech inflected for case, signifying a concrete or abstract entity
# Verb: a part of speech without case inflection, but inflected for tense, person and number, signifying an activity or process performed or undergone
# Participle: a part of speech sharing the features of the verb and the noun
# Article: a part of speech inflected for case and preposed or postposed to nouns (the relative pronoun is meant by the postposed article)
# Pronoun: a part of speech substitutable for a noun and marked for person
# Preposition: a part of speech placed before other words in composition and in syntax
# Adverb: a part of speech without inflection, in modification of or in addition to a verb
# Conjunction: a part of speech binding together the discourse and filling gaps in its interpretation
The [[Latin grammar]]ian [[Priscian]] ([[floruit|fl.]] [[500 CE]]) modified the above eight-fold system, substituting "[[interjection]]" for "article". It wasn't until 1767 that the [[adjective]] was taken as a separate class.
Traditional English grammar is patterned after the European tradition above, and is still taught in schools and used in [[dictionaries]]. It names eight parts of speech: [[noun]], [[verb]], [[adjective]], [[adverb]], [[pronoun]], [[preposition]], [[Grammatical conjunction|conjunction]], and [[interjection]] (sometimes called an exclamation).
==Controversies==
Since the Greek grammarians of 2nd century BCE, parts of speech have been defined by [[morphology (linguistics)|morphological]], [[syntax|syntactic]] and [[semantics|semantic]] criteria. However, there is currently no generally agreed-upon classification scheme that can apply to all languages, or even a set of criteria upon which such a scheme should be based.
Linguists recognize that the above list of eight word classes is simplified and artificial. For example, "adverb" is to some extent a catch-all class that includes words with many different functions. Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.
==Functional classification==
Common ways of delimiting words by function include:
* '''[[Open word classes]]:'''
**[[adjective]]s
**[[adverb]]s
**[[interjection]]s
**[[noun]]s
**[[verb]]s (except [[auxiliary verb]]s)
* '''[[Closed word classes]]:'''
**[[auxiliary verb]]s
**[[clitic]]s
**[[coverb]]s
**[[Grammatical conjunction|conjunction]]s
**[[determiner (class)|Determiner]]s ([[article (grammar)|article]]s, [[quantifier]]s, [[demonstrative adjective]]s, and [[possessive adjective]]s)
**[[grammatical particle|particle]]s
**[[measure word]]s
**[[adposition]]s (prepositions, postpositions, and circumpositions)
**[[preverb]]s
**[[pronoun]]s
**[[Contraction (grammar)|contraction]]s
**[[Names of numbers in English#Cardinal numbers|cardinal numbers]]
==English==
[[English language|English]] frequently does not [[marker (linguistics)|mark]] words as belonging to one part of speech or another. Words like ''neigh'', ''break'', ''outlaw'', ''laser'', ''microwave'' and ''telephone'' might all be either verb forms or nouns. Although ''-ly'' is an adverb marker, not all adverbs end in ''-ly'' and not all words ending in ''-ly'' are adverbs. For instance, ''tomorrow'', ''slow'', ''fast'', ''crosswise'' can all be adverbs, while ''early'', ''friendly'', ''ugly'' are all adjectives (though ''early'' can also function as an adverb).
In certain circumstances, even words with primarily grammatical functions can be used as verbs or nouns, as in "We must look to the ''hows'' and not just the ''whys''" or "Miranda was ''to-ing and fro-ing'' and not paying attention".
Part-of-speech tagging
'''Part-of-speech tagging''' ('''POS tagging''' or '''POST'''), also called '''grammatical tagging''', is the process of marking up the words in a text as corresponding to a particular [[parts of speech|part of speech]], based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a [[phrase]], [[sentence]], or [[paragraph]].
A simplified form of this is commonly taught school-age children, in the identification of words as [[noun]]s, [[verb]]s, [[adjective]]s, [[adverb]]s, etc.
Once performed by hand, POS tagging is now done in the context of [[computational linguistics]], using [[algorithms]] which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
==History==
Research on part-of-speech tagging has been closely tied to [[corpus linguistics]]. The first major corpus of English for computer analysis was the [[Brown Corpus]] developed at [[Brown University]] by [[Henry Kucera]] and [[Nelson Francis]], in the mid-1960s. It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
The [[Brown Corpus]] was painstakingly "tagged" with part-of-speech markers over many years. A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all. For example, article then noun can occur, but article verb (arguably) cannot. The program got about 70% correct. Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases even human speakers might not agree on).
This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages. Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and [[VOLSUNGA]]. However, by this time (2005) it has been superseded by larger corpora such as the 100 million word [[British National Corpus]].
For some time, part-of-speech tagging was considered an inseparable part of [[natural language processing]], because there are certain cases where the correct part of speech cannot be decided without understanding the [[semantics]] or even the [[pragmatics]] of the context. This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
In the mid 1980s, researchers in Europe began to use [[hidden Markov model]]s (HMMs) to disambiguate parts of speech, when working to tag the [[Lancaster-Oslo-Bergen Corpus]] of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but even less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn. The combination with highest probability is then chosen. The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.
It is worth remembering, as [[Eugene Charniak]] points out in ''Statistical techniques for natural language parsing'' [http://www.cs.brown.edu/people/ec/home.html], that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous.
CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities. It sometimes had to resort to backup methods when there were simply too many (the [[Brown Corpus]] contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech).
In 1987, [[Steve DeRose]] and [[Ken Church]] independently developed [[dynamic programming]] algorithms to solve the same problem in vastly less time. Their methods were similar to the [[Viterbi algorithm]] known for some time in other fields. DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus). Both methods achieved accuracy over 95%. DeRose's 1990 dissertation at [[Brown University]] included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.
These findings were surprisingly disruptive to the field of [[Natural Language Processing]]. The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on. CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare. This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well. Markov Models are now the standard method for part-of-speech assignment.
The methods already discussed involve working from a pre-existing corpus to learn tag probabilities. It is, however, also possible to [[Bootstrapping (linguistics)|bootstrap]] using "unsupervised" tagging. Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction. That is, they observe patterns in word use, and derive part-of-speech categories themselves. For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones. With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
These two categories can be further subdivided into rule-based, stochastic, and neural approaches. Some current major algorithms for '''part-of-speech tagging''' include the [[Viterbi algorithm]], [[Brill Tagger]], and the [[Baum-Welch algorithm]] (also known as the forward-backward algorithm). [[Hidden Markov model]] and [[visible Markov model]] taggers can both be implemented using the [[Viterbi algorithm]].
Pattern recognition
'''Pattern recognition''' is a sub-topic of [[machine learning]]. It can be defined as
:"the act of taking in raw data and taking an action based on the [[Category (taxonomy)|category]] of the data".
Most research in pattern recognition is about methods for [[supervised learning]] and [[unsupervised learning]].
Pattern recognition aims to classify [[data]] ([[pattern]]s) based on either ''[[A priori and a posteriori (philosophy)|a priori]]'' knowledge or on [[statistics|statistical]] information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate [[space (mathematics)|multidimensional space]]. This is in contrast to '''[[pattern matching]]''', where the pattern is rigidly specified.
==Overview==
A complete pattern recognition system consists of a [[sensor]] that gathers the observations to be classified or described; a [[feature extraction]] mechanism that computes numeric or symbolic information from the observations; and a [[statistical classification|classification]] or description scheme that does the actual job of classifying or describing observations, relying on the extracted features.
The classification or description scheme is usually based on the availability of a set of patterns that have already been classified or described. This set of patterns is termed the [[training set]] and the resulting learning strategy is characterized as [[supervised learning]]. Learning can also be [[unsupervised learning|unsupervised]], in the sense that the system is not given an ''a priori'' labeling of patterns, instead it establishes the classes itself based on the statistical regularities of the patterns.
The classification or description scheme usually uses one of the following approaches: [[statistical classification|statistical]] (or decision theoretic), [[syntactic pattern recognition|syntactic]] (or structural). Statistical pattern recognition is based on statistical characterisations of patterns, assuming that the patterns are generated by a [[probabilistic]] system. Syntactical (or structural) pattern recognition is based on the structural interrelationships of features. A wide range of algorithms can be applied for pattern recognition, from very simple [[Naive Bayes classifier|Bayesian classifiers]] to much more powerful [[Artificial neural network|neural networks]].
An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
Pattern recognition is more complex when templates are used to generate variants. For example, in English, sentences often follow the "N-VP" (noun - verb phrase) pattern, but some knowledge of the English language is required to detect the pattern. Pattern recognition is studied in many fields, including [[psychology]], [[ethology]], and [[computer science]]. [[Holographic associative memory]] is another type of pattern matching scheme where a target small patterns can be searched from a large set of learned patterns based on cognitive meta-weight.
==Uses==
Within medical science pattern recognition creates the basis for [[computer-aided diagnosis]] (CAD) systems. CAD describes a procedure that supports the doctor's interpretations and findings.
Typical applications are automatic [[speech recognition]], [[document classification|classification of text into several categories]] (e.g. spam/non-spam email messages), the [[handwriting recognition|automatic recognition of handwritten postal codes]] on postal envelopes, or the [[facial recognition system|automatic recognition of images]] of human faces. The last two examples form the subtopic [[image analysis]] of pattern recognition that deals with digital images as input to pattern recognition systems.
Phrase
In [[grammar]], a '''phrase''' is a group of [[word]]s that functions as a single unit in the [[syntax]] of a [[Sentence (linguistics)|sentence]].
For example ''the house at the end of the street'' (example 1) is a phrase. It acts like a noun. It contains the phrase ''at the end of the street'' (example 2), a prepositional phrase which acts like an adjective. Example 2 could be replaced by ''white'', to make the phrase ''the white house''. Examples 1 and 2 contain the phrase ''the end of the street'' (example 3) which acts like a noun. It could be replaced by ''the cross-roads'' to give ''the house at the cross-roads''.
Most phrases have a or central word which defines the type of phrase. This word is called the [[head (linguistics)|head]] of the phrase. In English the head is often the first word of the phrase. Some phrases, however, can be headless. For example, ''the rich'' is a noun phrase composed of a determiner and an adjective, but no noun.
Phrases may be classified by the type of head they take
*[[Prepositional phrase]] (PP) with a [[preposition]] as head (e.g. ''in love'', ''over the rainbow''). Languages that use [[postposition]]s instead have [[postpositional phrase]]s. The two types are sometimes commonly referred to as [[adpositional phrase]]s.
*[[Noun phrase]] (NP) with a [[noun]] as head (e.g. ''the black cat'', ''a cat on the mat'')
*[[Verb phrase]] (VP) with a [[verb]] as head (e.g. ''eat cheese'', ''jump up and down'')
*[[Adjectival phrase]] with an [[adjective]] as head (e.g. ''full of toys'')
*[[Adverbial phrase]] with [[adverb]] as head (e.g. ''very carefully'')
== Formal definition ==
A '''phrase''' is a [[syntax|syntactic]] structure which has syntactic properties derived from its [[head (linguistics)|head]].
== Complexity ==
A complex phrase consists of several words, whereas a simple phrase consists of only one word. This terminology is especially often used with [[verb]] phrases:
* simple past and present are simple verb, which require just one verb
* complex verb have one or two [[grammatical aspect|aspect]]s added, hence require additional two or three words
"Complex", which is phrase-level, is often confused with "[[compound (linguistics)|compound]]", which is [[word]]-level. However, there are certain phenomena that formally seem to be phrases but semantically are more like compounds, like "women's magazines", which has the form of a possessive noun phrase, but which refers (just like a compound) to one specific [[lexeme]] (i.e. a magazine for women and not some magazine owned by a woman).
== Semiotic approaches to the concept of "phrase" ==
In more [[semiotic]] approaches to language, such as the more cognitivist versions of [[construction grammar]], a phrasal structure is not only a certain formal combination of word types whose features are inherited from the head. Here each phrasal structure also expresses some type of [[concept]]ual content, be it specific or abstract.
Portuguese language
'''Portuguese''' ( or ''língua portuguesa'') is a [[Romance language]] that originated in what is now [[Galicia (Spain)]] and [[Portugal|northern Portugal]] from the [[Latin language|Latin]] spoken by [[Romanization (cultural)|romanized]] [[Pre-Roman peoples of the Iberian Peninsula]] (namely the [[Gallaeci]], the [[Lusitanians]], the [[Celtici]] and the [[Conii]]) about 2000 years ago. It spread worldwide in the 15th and 16th centuries as Portugal established a [[Portuguese Empire|colonial and commercial empire]] (1415–1999) which spanned from [[Brazil]] in the [[Americas]] to [[Goa]] in [[India]] and [[Macau]] in [[China]], in fact it was used exclusively on the island of [[Sri Lanka]] as the [[lingua franca]] for almost 350 years. During that time, many [[Portuguese Creole|creole languages based on Portuguese]] also appeared around the world, especially in [[Africa]], [[Asia]], and the [[Caribbean]].
Today it is one of the world's major languages, [[List of languages by number of native speakers|ranked 6th]] according to number of native speakers (approximately 177 million). It is the language with the largest number of speakers in [[South America]], spoken by nearly all of Brazil's population, which amounts to over 51% of the continent's population even though it is the only Portuguese-speaking nation in [[the Americas]]. It is also a major lingua franca in Portugal's former colonial possessions in Africa. It is the official language of ten countries (see the table on the right), also being co-official with [[Spanish language|Spanish]] and [[French language|French]] in [[Equatorial Guinea]], with [[Standard Cantonese|Cantonese]] [[Chinese language|Chinese]] in the Chinese special administrative region of [[Macau]], and with [[Tetum]] in [[East Timor]]. There are sizable communities of Portuguese-speakers in various regions of North America, notably in the [[United States]] ([[New Jersey]], [[New England]] and south [[Florida]]) and in [[Ontario]], [[Canada]]. [[Spain|Spanish]] author [[Miguel de Cervantes]] once called Portuguese "the sweet language", while Brazilian writer [[Olavo Bilac]] poetically described it as ''a última flor do Lácio, inculta e bela'': "the last flower of [[Latium]], wild and beautiful".
==Geographic distribution==
Today, Portuguese is the [[official language]] of [[Angola]], [[Brazil]], [[Cape Verde]], [[Guinea-Bissau]], [[Portugal]], [[São Tomé and Príncipe]] and [[Mozambique]]. It is also one of the official languages of [[Equatorial Guinea]] (with [[Spanish language|Spanish]] and [[French language|French]]), the [[Special Administrative Region of the People's Republic of China|Chinese special administrative region]] of [[Macau]] (with [[Chinese language|Chinese]]), and [[East Timor]], (with [[Tetum]]). It is a [[First language|native language]] of most of the population in Portugal (100%), Brazil (99%), Angola (60%), and São Tomé and Príncipe (50%), and it is spoken by a [[plurality]] of the population of Mozambique (40%), though only 6.5% are native speakers. No data is available for Cape Verde, but almost all the population is bilingual, and the monolingual population speaks [[Cape Verdean Creole]].
Small Portuguese-speaking communities subsist in former overseas colonies of Portugal such as Macau, where it is spoken as a first language by 0.6% of the population and East Timor.
[[Uruguay]] gave Portuguese an equal status to Spanish in its educational system at the north border with Brazil. In the rest of the country, it's taught as an obligatory subject beginning by the 6th grade.
It is also spoken by substantial immigrant communities, though not official, in [[Andorra]], [[France]], [[Luxembourg]], [[Jersey]] (with a statistically significant Portuguese-speaking community of approximately 10,000 people), [[Paraguay]], [[Namibia]], [[South Africa]], [[Switzerland]], [[Venezuela]] and in the [[U.S.]] states of [[California]], [[Connecticut]], [[Florida]], [[Massachusetts]], [[New Jersey]], [[New York]] and [[Rhode Island]].
In some parts of India, such as [[Goa]] and [[Daman and Diu]] Portuguese is still spoken. There are also significant populations of Portuguese speakers in [[Canada]] (mainly concentrated in and around [[Toronto]]) [[Bermuda]] and [[Netherlands Antilles]].
Portuguese is an official language of several international organizations. The [[Community of Portuguese Language Countries]] (with the Portuguese acronym CPLP) consists of the eight independent countries that have Portuguese as an official language. It is also an official language of the [[European Union]], [[Mercosul]], the [[Organization of American States]], the [[Organization of Ibero-American States]], the [[Union of South American Nations]], and the [[African Union]] (one of the working languages) and one of the official languages of other organizations. The Portuguese language is gaining popularity in Africa, Asia, and South America as a second language for study.
Portuguese and Spanish are the fastest-growing European languages, and, according to estimates by UNESCO, Portuguese is the language with the highest potential for growth as an international language in southern Africa and South America. The Portuguese-speaking African countries are expected to have a combined population of 83 million by 2050. Since 1991, when Brazil signed into the economic market of Mercosul with other South American nations, such as Argentina, Uruguay, and Paraguay, there has been an increase in interest in the study of Portuguese in those South American countries. The demographic weight of Brazil in the continent will continue to strengthen the presence of the language in the region. Although in the early 21st century, after Macau was ceded to China in 1999, the use of Portuguese was in decline in Asia, it is becoming a language of opportunity there; mostly because of East Timor's boost in the number of speakers in the last five years but also because of increased Chinese diplomatic and financial ties with Portuguese-speaking countries.
In July 2007, President Teodoro Obiang Nguema announced his government's decision to make Portuguese [[Equatorial Guinea]]'s third official language, in order to meet the requirements to apply for full membership of the [[Community of Portuguese Language Countries]]. This upgrading from its current Associate Observer condition would result in Equatorial Guinea being able to access several professional and academic exchange programs and the facilitation of cross-border circulation of citizens. Its application is currently being assessed by other CPLP members.
In March 1994 the [[Bosque de Portugal]] (Portugal's Woods) was founded in the Brazilian city of [[Curitiba]]. The park houses the Portuguese Language Memorial, which honors the Portuguese immigrants and the countries that adopted the Portuguese language. Originally there were seven nations represented with pillars, but the independence of [[East Timor]] brought yet another pillar for that nation in 2007.
In March 2006, the [[Museum of the Portuguese Language]], an interactive museum about the Portuguese language, was founded in [[São Paulo]], Brazil, the city with the largest number of Portuguese speakers in the world.
==Dialects==
Portuguese is a [[pluricentric language]] with two main groups of [[dialect]]s, those of [[Brazil]] and those of the [[Old World]]. For historical reasons, the dialects of Africa and Asia are generally closer to those of Portugal than the Brazilian dialects, although in some aspects of their phonetics, especially the pronunciation of unstressed vowels, they resemble [[Brazilian Portuguese]] more than [[European Portuguese]]. They have not been studied as widely as European and Brazilian Portuguese.
Audio samples of some dialects of Portuguese are available below. There are some differences between the areas but these are the best approximations possible. For example, the ''caipira'' dialect has some differences from the one of Minas Gerais, but in general it is very close. A good example of Brazilian Portuguese may be found in the capital city, [[Brasília]], because of the generalized population from all parts of the country.
'''[[Angola]]'''
# ''Benguelense'' — [[Benguela]] province.
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som85.html ''Luandense''] — [[Luanda]] province.
# ''Sulista'' — South of Angola.
'''[[Brazil]]'''
# ''[[Caipira]]'' — States of [[São Paulo (state)|São Paulo]] (countryside; the city of São Paulo and the eastern areas of the state have their own dialect, called ''paulistano''); southern [[Minas Gerais]], northern [[Paraná (state)|Paraná]], [[Goiás]] and [[Mato Grosso do Sul]].
# ''Cearense'' — [[Ceará]].
# ''Baiano'' — [[Bahia]].
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som90.html ''Fluminense''] — Variants spoken in the states of [[Rio de Janeiro (state)|Rio de Janeiro]] and [[Espírito Santo]] (excluding the city of Rio de Janeiro and its adjacent metropolitan areas, which have their own dialect, called ''[[carioca]]'').
# ''[[Gaucho|Gaúcho]]'' — [[Rio Grande do Sul]]. (There are many distinct accents in Rio Grande do Sul, mainly due to the heavy influx of European immigrants of diverse origins, those which have settled several colonies throughout the state.)
# ''[[Mineiro]]'' — [[Minas Gerais]] (not prevalent in the [[Triângulo Mineiro]], southern and southeastern [[Minas Gerais]]).
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som91.html ''Nordestino''] — [[Northeast Region, Brazil|northeastern states of Brazil]] ([[Pernambuco]] and [[Rio Grande do Norte]] have a particular way of speaking).
# ''Nortista'' — [[Amazon Basin]] states.
# ''Paulistano'' — Variants spoken around [[São Paulo]] city and the eastern areas of São Paulo state.
# ''Sertanejo'' — States of [[Goiás]] and [[Mato Grosso]] (the city of [[Cuiabá]] has a particular way of speaking).
# ''Sulista'' — Variants spoken in the areas between the northern regions of [[Rio Grande do Sul]] and southern regions of São Paulo state. (The cities of [[Curitiba]], [[Florianópolis]], and [[Itapetininga]] have fairly distinct accents as well.)
'''[[Portugal]]'''
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som69.html ''Açoriano''] (Azorean) — [[Azores]].
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som40.html ''Alentejano''] — [[Alentejo]]
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som44.html ''Algarvio''] — [[Algarve]] (there is a particular dialect in a small part of western Algarve).
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som1.html ''Alto-Minhoto''] — North of [[Braga]] (hinterland).
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som49.html ''Baixo-Beirão''; ''Alto-Alentejano''] — Central Portugal (hinterland).
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som9.html ''Beirão''] — Central Portugal.
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som22.html ''Estremenho''] — Regions of [[Coimbra]] and [[Lisbon]] (the Lisbon dialect has some peculiar features not shared with the one of Coimbra).
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som60.html ''Madeirense''] (Madeiran) — [[Madeira]].
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som14.html ''Nortenho''] — Regions of Braga and [[Porto]].
# [http://www.instituto-camoes.pt/cvc/hlp/geografia/som6.html ''Transmontano''] — [[Trás-os-Montes e Alto Douro]].
Other countries
* '''[[Cape Verde]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som87.html ''Português cabo-verdiano''] ([[Cape Verdean Portuguese]])
* '''[[Daman and Diu]]''', India — ''Damaense''.
* '''[[East Timor]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som84.html ''Timorense''] ([[East Timorese Portuguese|East Timorese]])
* '''[[Goa]]''', India — ''Goês''.
* '''[[Guinea-Bissau]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som88.html ''Guineense''] ([[Guinean Portuguese]]).
* '''[[Macau]]''', China — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som92.html ''Macaense''] ([[Macanese Portuguese|Macanese]])
* '''[[Mozambique]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som89.html ''Moçambicano''] ([[Mozambican Portuguese|Mozambican]])
* '''[[São Tomé and Príncipe]]''' — [http://www.instituto-camoes.pt/cvc/hlp/geografia/som83.html ''Santomense'']
* '''[[Uruguay]]''' — [[Riverense Portuñol language|''Dialectos Portugueses del Uruguay (DPU)'']].
Differences between dialects are mostly of [[accent (linguistics)|accent]] and [[vocabulary]], but between the Brazilian dialects and other dialects, especially in their most coloquial forms, there can also be some grammatical differences. The [[Portuguese creole|Portuguese-based creole]]s spoken in various parts of Africa, Asia, and the Americas are independent languages which should not be confused with Portuguese itself.
==History==
Arriving in the Iberian Peninsula in 216 BC, the Romans brought with them the [[Latin language]], from which all Romance languages descend. The language was spread by arriving Roman soldiers, settlers and merchants, who built Roman cities mostly near the settlements of previous civilizations.
Between AD 409 and 711, as the Roman Empire collapsed in Western Europe, the Iberian Peninsula was conquered by Germanic peoples ([[Migration Period]]). The occupiers, mainly [[Suebi]] and [[Visigoths]], quickly adopted late Roman culture and the [[Vulgar Latin]] dialects of the peninsula. After the [[Moors|Moorish]] invasion of 711, [[Arabic language|Arabic]] became the administrative language in the conquered regions, but most of the population continued to speak a form of [[Romance languages|Romance]] commonly known as [[Mozarabic]]. The influence exerted by Arabic on the Romance dialects spoken in the Christian kingdoms of the north was small, affecting mainly their lexicon.
The earliest surviving records of a distinctively Portuguese language are administrative documents of the 9th century, still interspersed with many Latin phrases. Today this phase is known as Proto-Portuguese (between the 9th and the 12th centuries). In the first period of Old Portuguese — [[Galician-Portuguese]] Period (from the 12th to the 14th century) — the language gradually came into general use. For some time, it was the language of preference for [[lyric poetry]] in Christian Hispania, much like [[Occitan]] was the language of the [[Occitan literature#Poetry_of_the_troubadours|poetry of the troubadours]]. Portugal was formally recognized as an independent kingdom by the [[Kingdom of Leon]] in 1143, with [[Afonso I of Portugal|Afonso Henriques]] as king. In 1290, king [[Denis of Portugal|Dinis]] created the first Portuguese university in Lisbon (the ''Estudos Gerais'', later moved to [[Coimbra]]) and decreed that Portuguese, then simply called the "common language" should be known as the Portuguese language and used officially.
In the second period of Old Portuguese, from the 14th to the 16th century, with the [[Age of discovery|Portuguese discoveries]], the language was taken to many regions of [[Asia]], [[Africa]] and the [[Americas]] (nowadays, the great majority of Portuguese speakers live in Brazil, in South America). By the 16th century it had become a ''[[lingua franca]]'' in Asia and Africa, used not only for colonial administration and trade but also for communication between local officials and Europeans of all nationalities. Its spread was helped by mixed marriages between Portuguese and local people, and by its association with [[Roman Catholic]] [[missionary]] efforts, which led to the formation of a [[creole language]] called [[Kristang language|Kristang]] in many parts of Asia (from the word ''cristão'', "Christian"). The language continued to be popular in parts of Asia until the 19th century. Some Portuguese-speaking Christian communities in [[India]], [[Sri Lanka]], [[Malaysia]], and [[Indonesia]] preserved their language even after they were isolated from Portugal.
The end of the Old Portuguese period was marked by the publication of the ''Cancioneiro Geral'' by [[Garcia de Resende]], in 1516. The early times of Modern Portuguese, which spans from the 16th century to present day, were characterized by an increase in the number of learned words borrowed from Classical Latin and Classical Greek since the Renaissance, which greatly enriched the lexicon.
===Characterization===
A distinctive feature of Portuguese is that it preserved the stressed vowels of [[Vulgar Latin]], which became diphthongs in other Romance languages; cf. Fr. ''pierre'', Sp. ''piedra'', It. ''pietra'', Port. ''pedra'', from Lat. ''petra''; or Sp. ''fuego'', It. ''fuoco'', Port. ''fogo'', from Lat. ''focum''. Another characteristic of early Portuguese was the loss of [[:wiktionary:intervocalic|intervocalic]] ''l'' and ''n'', sometimes followed by the merger of the two surrounding vowels, or by the insertion of an [[epenthesis|epenthetic vowel]] between them: cf. Lat. ''salire'', ''tenere'', ''catena'', Sp. ''salir'', ''tener'', ''cadena'', Port. ''sair'', ''ter'', ''cadeia''.
When the [[elision|elided]] consonant was ''n'', it often [[nasalization|nasalized]] the preceding vowel: cf. Lat. ''manum'', ''rana'', ''bonum'', Port. ''mão'', ''rãa'', ''bõo'' (now ''mão'', ''rã'', ''bom''). This process was the source of most of the nasal diphthongs which are typical of Portuguese. In particular, the Latin endings ''-anem'', ''-anum'' and ''-onem'' became ''-ão'' in most cases, cf. Lat. ''canem'', ''germanum'', ''rationem'' with Modern Port. ''cão'', ''irmão'', ''razão'', and their plurals ''-anes'', ''-anos'', ''-ones'' normally became ''-ães'', ''-ãos'', ''-ões'', cf. ''cães'', ''irmãos'', ''razões''.
===Movement to make Portuguese an official language of the UN===
There is a growing number of people in the Portuguese speaking media and the internet who are presenting the case to the CPLP and other organizations to run a debate in the [[Lusophone]] community with the purpose of bringing forward a petition to make Portuguese an official language of the United Nations.
In October 2005, during the international Convention of the [http://www.elosinternacional.com.br/index.htm Elos Club International ] that took place in Tavira, Portugal a petition was written and unanimously approved whose text can be found on the internet with the title ''Petição Para Tornar Oficial o Idioma Português na ONU''.
Romulo Alexandre Soares, president of the Brazil-Portugal Chamber highlights that the positioning of Brazil in the international arena as one of the emergent powers of the 21 century, the size of its population, and the presence of the language around the world provides legitimacy and justifies a petition to the UN to make the Portuguese an official language at the UN.
==Vocabulary==
Most of the lexicon of Portuguese is derived from Latin. Nevertheless, because of the [[Moors|Moorish]] occupation of the [[Iberian Peninsula]] during the Middle Ages, and the participation of Portugal in the [[Age of Discovery]], it has adopted loanwords from all over the world.
Very few Portuguese words can be traced to the [[Pre-Roman peoples of the Iberian Peninsula|pre-Roman inhabitants of Portugal]], which included the [[Gallaeci]], [[Lusitanians]], [[Celtici]] and [[Cynetes]]. The [[Phoenicians]] and [[Carthaginians]], briefly present, also left some scarce traces. Some notable examples are ''abóbora'' "pumpkin" and ''bezerro'' "year-old calf", from the nearby [[Celtiberian language]] (probably through the Celtici); ''cerveja'' "beer", from [[Celtic languages|Celtic]]; ''saco'' "bag", from [[Phoenician language|Phoenician]]; and ''cachorro'' "dog, puppy", from [[Basque language|Basque]].
In the 5th century, the Iberian Peninsula (the [[Ancient Rome|Roman]] [[Hispania]]) was conquered by the [[Germanic peoples|Germanic]] [[Suevi]] and [[Visigoths]]. As they adopted the Roman civilization and language, however, these people contributed only a few words to the lexicon, mostly related to warfare — such as ''espora'' "spur", ''estaca'' "stake", and ''guerra'' "war", from [[Gothic language|Gothic]] ''*spaúra'', ''*stakka'', and ''*wirro'', respectively.
Between the 9th and 15th centuries Portuguese acquired about 1000 words from [[Arabic language|Arabic]] by influence of [[al-Andalus|Moorish Iberia]]. They are often recognizable by the initial Arabic article ''a''(''l'')''-'', and include many common words such as ''aldeia'' "village" from الضيعة ''aldaya'', ''alface'' "lettuce" from الخس ''alkhass'', ''armazém'' "warehouse" from المخزن ''almahazan'', and ''azeite'' "olive oil" from زيت ''azzait''. From Arabic came also the grammatically peculiar word [[Insha'Allah|''oxalá'']] "hopefully". The Mozambican currency name [[Mozambican Metical|''metical'']] was derived from the word مطقال ''miṭqāl'', a unit of weight. The word Mozambique itself is from the Arabic name of sultan Muça Alebique (Musa Alibiki). The name of the Portuguese town of [[Fátima, Portugal|Fátima]] comes from the name of one of the daughters of the prophet [[Muhammad]].
Starting in the 15th century, the Portuguese maritime explorations led to the introduction of many loanwords from [[Asia]]n languages. For instance, ''catana'' "cutlass" from Japanese ''katana''; ''corja'' "rabble" from Malay ''kórchchu''; and ''chá'' "tea" from [[Chinese language|Chinese]] ''[[Tea#The word tea|''chá'']]''.
From South America came ''batata'' "[[potato]]", from [[Taino]]; ''ananás'' and ''abacaxi'', from [[Tupi-Guarani]] ''naná'' and [[Tupi language|Tupi]] ''ibá cati'', respectively (two species of [[pineapple]]), and ''tucano'' "[[toucan]]" from [[Guarani language|Guarani]] ''tucan''. See [[List of Brazil state name etymologies]], for some more examples.
From the 16th to the 19th century, the role of Portugal as intermediary in the [[Atlantic slave trade]], with the establishment of large Portuguese colonies in Angola, Mozambique, and Brazil, Portuguese got several words of African and [[indigenous peoples of Brazil|Amerind]] origin, especially names for most of the animals and plants found in those territories. While those terms are mostly used in the former colonies, many became current in European Portuguese as well. From [[Kimbundu language|Kimbundu]], for example, came ''kifumate'' → ''cafuné'' "head caress", ''kusula'' → ''caçula'' "youngest child", ''marimbondo'' "tropical wasp", and ''kubungula'' → ''bungular'' "to dance like a wizard".
Finally, it has received a steady influx of loanwords from other European languages. For example, ''melena'' "hair lock", ''fiambre'' "wet-cured ham" (in contrast with ''presunto'' "dry-cured ham" from Latin ''prae-exsuctus'' "dehydrated"), and ''castelhano'' "Castilian", from Spanish; ''colchete''/''crochê'' "bracket"/"crochet", ''paletó'' "jacket", ''batom'' "lipstick", and ''filé''/''filete'' "steak"/"slice" respectively, from French ''crochet'', ''paletot'', ''bâton'', ''filet''; ''macarrão'' "pasta", ''piloto'' "pilot", ''carroça'' "carriage", and ''barraca'' "barrack", from Italian ''maccherone'', ''pilota'', ''carrozza'', ''baracca''; and ''bife'' "steak", ''futebol'', ''revólver'', ''estoque'', ''folclore'', from English ''beef'', ''football'', ''revolver'', ''stock'', ''folklore''.
==Classification and related languages==
Portuguese belongs to the [[West Iberian languages|West Iberian]] branch of the [[Romance language]]s, and it has special ties with the following members of this group:
* [[Galician language|Galician]] and the [[Fala language|Fala]], its closest relatives. See below.
* [[Spanish language|Spanish]], the major language closest to Portuguese. (See also [[Differences between Spanish and Portuguese]].)
* [[Mirandese language|Mirandese]], another West Iberian language spoken in Portugal.
* [[Judeo-Portuguese]] and [[Ladino language|Judeo-Spanish]], languages spoken by [[Sephardic Jew]]s, which remained close to Portuguese and Spanish.
Despite the obvious lexical and grammatical similarities between Portuguese and other Romance languages, it is not [[mutually intelligible]] with most of them. Apart from Galician, Portuguese speakers will usually need some formal study of basic grammar and vocabulary, before attaining a reasonable level of comprehension of those languages, and vice-versa.
===Galician and the Fala===
The closest language to Portuguese is Galician, spoken in the autonomous community of Galicia (northwestern Spain). The two were at one time a single language, known today as [[Galician-Portuguese]], but since the political separation of Portugal from Galicia they have diverged somewhat, especially in pronunciation and vocabulary. Nevertheless, the core vocabulary and grammar of Galician are still noticeably closer to Portuguese than to Spanish. In particular, like Portuguese, it uses the future subjunctive, the personal infinitive, and the synthetic pluperfect (see the section on the grammar of Portuguese, below). Mutual intelligibility (estimated at 85% by R. A. Hall, Jr., 1989) is good between Galicians and northern Portuguese, but poorer between Galicians and speakers from central Portugal.
The Fala language is another descendant of Galician-Portuguese, spoken by a small number of people in the Spanish towns of Valverdi du Fresnu, As Ellas and Sa Martín de Trebellu (autonomous community of [[Extremadura]], near the border with Portugal).
===Influence on other languages===
Many languages have [[loanword|borrowed words]] from Portuguese, such as [[Bahasa Indonesia|Indonesian]], [[Sri Lanka]]n [[Sri Lanka Tamils (native)|Tamil]] and [[Sinhalese language|Sinhalese]] (see [[Sri Lanka Indo-Portuguese language|Sri Lanka Indo-Portuguese]]), [[Malay language|Malay]], [[Bengali language|Bengali]], [[English (language)|English]], [[Hindi]], [[Konkani language|Konkani]], [[Marathi language|Marathi]], [[Tetum language|Tetum]], [[Tsonga language|Xitsonga]], [[Papiamentu]], [[Japanese language|Japanese]], [[Barbadian|Bajan Creole]] (Spoken in Barbados), [[Lanc-Patuá]] (spoken in northern Brazil) and [[Sranan Tongo]] (spoken in Suriname). It left a strong influence on the ''[[Old Tupi|língua brasílica]]'', a [[Tupi-Guarani|Tupi-Guarani language]] which was the most widely spoken in [[Brazil]] until the 18th century, and on the language spoken around [[Sikka]] in [[Flores|Flores Island]], [[Indonesia]]. In nearby [[Larantuka]], Portuguese is used for prayers in [[Holy Week]] rituals.
The Japanese-Portuguese dictionary ''[[Nippo Jisho]]'' (1603) was the first dictionary of Japanese in a European language, a product of [[Society of Jesus|Jesuit]] missionary activity in [[Japan]]. Building on the work of earlier Portuguese missionaries, the ''Dictionarium Anamiticum, Lusitanum et Latinum'' (Annamite-Portuguese-Latin dictionary) of [[Alexandre de Rhodes]] (1651) introduced the modern [[Vietnamese alphabet|orthography of Vietnamese]], which is based on the orthography of 17th-century Portuguese. The [[Romanization]] of [[Chinese language|Chinese]] was also influenced by the Portuguese language (among others), particularly regarding [[List of common Chinese surnames|Chinese surnames]]; one example is ''Mei''.
See also [[List of English words of Portuguese origin]], [[Loan words in Indonesian]], [[Japanese words of Portuguese origin]], [[Malay_language#Borrowed_words|Borrowed words in Malay]], [[Sinhala words of Portuguese origin]], [[Loan words in Sri Lankan Tamil#Portuguese|Loan words from Portuguese in Sri Lankan Tamil]].
===Derived languages===
Beginning in the 16th century, the extensive contacts between Portuguese travelers and settlers, African slaves, and local populations led to the appearance of many [[pidgin]]s with varying amounts of Portuguese influence. As these pidgins became the mother tongue of succeeding generations, they evolved into fully fledged [[creole language]]s, which remained in use in many parts of Asia and Africa until the 18th century. Some Portuguese-based or Portuguese-influenced creoles are still spoken today, by over 3 million people worldwide, especially people of partial [[Portuguese people|Portuguese]] ancestry.
== Phonology ==
There is a maximum of 9 oral vowels and 19 consonants, though some varieties of the language have fewer phonemes (Brazilian Portuguese has only 8 oral vowel [[phone]]s). There are also five nasal vowels, which some linguists regard as allophones of the oral vowels, ten oral [[diphthong]]s, and five nasal diphthongs.
===Vowels===
To the seven vowels of [[Vulgar Latin]], European Portuguese has added two [[Mid-centralized vowel|near central vowels]], one of which tends to be [[elision|elided]] in [[relaxed pronunciation|rapid speech]], like the ''e caduc'' of [[French language|French]] (represented either as {{IPA|/ɯ̽/}}, or {{IPA|/ɨ/}}, or {{IPA|/ə/}}). The high vowels {{IPA|/e o/}} and the low vowels {{IPA|/ɛ ɔ/}} are four distinct phonemes, and they alternate in various forms of [[apophony]]. Like [[Catalan language|Catalan]], Portuguese uses vowel quality to contrast stressed syllables with unstressed syllables: isolated vowels tend to be [[Vowel#Height|raised]], and in some cases centralized, when unstressed. Nasal diphthongs occur mostly at the end of words.
===Consonants===
The consonant inventory of Portuguese is fairly conservative. The medieval affricates {{IPA|/ts/}}, {{IPA|/dz/}}, {{IPA|/tʃ/}}, {{IPA|/dʒ/}} merged with the fricatives {{IPA|/s/}}, {{IPA|/z/}}, {{IPA|/ʃ/}}, {{IPA|/ʒ/}}, respectively, but not with each other, and there were no other significant changes to the consonant phonemes since then. However, some remarkable dialectal variants and [[allophone]]s have appeared, among which:
*In many regions of Brazil, {{IPA|/t/}} and {{IPA|/d/}} have the affricate allophones {{IPA|[tʃ]}} and {{IPA|[dʒ]}}, respectively, before {{IPA|/i/}} and {{IPA|/ĩ/}}. ([[Quebec French]] has a similar phenomenon, with alveolar affricates instead of postalveolars. [[Japanese language|Japanese]] is another example).
*At the end of a syllable, the phoneme {{IPA|/l/}} has the allophone {{IPA|[u̯]}} in Brazilian Portuguese (''[[L-vocalization#L-vocalization|L-vocalization]]'').
*In many parts of Brazil and Angola, intervocalic {{IPA|/ɲ/}} is pronounced as a [[nasalization|nasalized]] [[palatal approximant]] {{IPA|[j̃]}} which nasalizes the preceding vowel, so that for instance {{IPA|/ˈniɲu/}} is pronounced {{IPA|[ˈnĩj̃u]}}.
*In most of Brazil, the alveolar sibilants {{IPA|/s/}} and {{IPA|/z/}} occur in complementary distribution at the end of syllables, depending on whether the consonant that follows is voiceless or voiced, as in English. But in most of Portugal and parts of Brazil sibilants are postalveolar at the end of syllables, {{IPA|/ʃ/}} before voiceless consonants, and {{IPA|/ʒ/}} before voiced consonants (in [[Ladino language|Judeo-Spanish]], {{IPA|/s/}} is often replaced with {{IPA|/ʃ/}} at the end of syllables, too).
*There is considerable dialectal variation in the value of the [[Rhotic consonant|rhotic]] phoneme {{IPA|/ʁ/}}. See [[Guttural R#Portuguese|Guttural R in Portuguese]], for details.
==Grammar==
A particularly interesting aspect of the grammar of Portuguese is the verb. Morphologically, more verbal inflections from classical Latin have been preserved by Portuguese than any other major Romance language. See [[Romance copula#Morphological comparison|Romance copula]], for a detailed comparison. It has also some innovations not found in other Romance languages (except Galician and the Fala):
* The [[present perfect tense]] has an iterative sense unique among the Romance languages. It denotes an action or a series of actions which began in the past and are expected to keep repeating in the future. For instance, the sentence ''Tenho tentado falar com ela'' would be translated to "I have been trying to talk to her", not "I have tried to talk to her". On the other hand, the correct translation of the question "Have you heard the latest news?" is not ''*Tem ouvido a última notícia?'', but ''Ouviu a última notícia?'', since no repetition is implied.
* The future [[Subjunctive mood|subjunctive]] tense, which was developed by medieval [[West Iberian languages|West Iberian Romance]], but has now fallen into disuse in Spanish, is still used in [[vernacular]] Portuguese. It appears in dependent clauses that denote a condition which must be fulfilled in the future, so that the independent clause will occur. Other languages normally employ the present tense under the same circumstances:
:''Se ''for'' eleito presidente, mudarei a lei.''
:If ''I am'' elected president, I will change the law.
:''Quando ''fores'' mais velho, vais entender.''
:When ''you are'' older, you will understand.
* The personal [[infinitive]]: infinitives can [[inflection|inflect]] according to their subject in [[Grammatical person|person]] and [[Grammatical number|number]], often showing who is expected to perform a certain action; cf. ''É melhor voltares'' "It is better [for you] to go back," ''É melhor voltarmos'' "It is better [for us] to go back." Perhaps for this reason, infinitive clauses replace subjunctive clauses more often in Portuguese than in other Romance languages.
==Writing system==
Portuguese is written with the [[Latin alphabet]], making use of five [[diacritic]]s to denote stress, vowel height, contraction, nasalization, and other sound changes (acute accent, grave accent, circumflex accent, tilde, and cedilla). [[Brazilian Portuguese]] also uses the diaeresis mark. Accented characters and digraphs are not counted as separate letters for [[collation]] purposes.
===Brazilian vs. European spelling===
There are some minor differences between the orthographies of Brazil and other Portuguese language countries. One of the most pervasive is the use of acute accents in the European/African/Asian orthography in many words such as ''sinónimo'', where the Brazilian orthography has a circumflex accent, ''sinônimo''. Another important difference is that Brazilian spelling often lacks ''c'' or ''p'' before ''c'', ''ç'', or ''t'', where the European orthography has them; for example, cf. Brazilian ''fato'' with European ''facto'', "fact", or Brazilian ''objeto'' with European ''objecto'', "object". Some of these spelling differences reflect differences in the pronunciation of the words, but others are merely graphic.
==Examples==
;Excerpt from the Portuguese [[national epic]] ''[[Os Lusíadas]]'', by author [[Luís de Camões]] (I, 33)
Predictive analytics
'''Predictive analytics''' encompasses a variety of techniques from [[statistics]] and [[data mining]] that analyze current and historical data to make predictions about future events. Such predictions rarely take the form of absolute statements, and are more likely to be expressed as values that correspond to the odds of a particular event or behavior taking place in the future.
In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.
One of the most well-known applications is [[credit scoring]], which is used throughout [[financial services]]. Scoring models process a customer’s [[credit history]], [[loan application]], customer data, etc., in order to rank-order individuals by their likelihood of making future credit payments on time. Predictive analytics are also used in [[insurance]], [[telecommunications]], [[retail]], [[travel]], [[healthcare]], [[Pharmaceutical company|pharmaceuticals]] and other fields.
== Types of predictive analytics ==
Generally, predictive analytics is used to mean [[predictive modeling]], scoring of predictive models, and [[forecasting]]. However, people are increasingly using the term to describe related analytic disciplines, such as descriptive modeling and decision modeling or optimization. These disciplines also involve rigorous data analysis, and are widely used in business for segmentation and decision making, but have different purposes and the statistical techniques underlying them vary.
===Predictive models===
Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in the future in order to improve [[marketing effectiveness]]. This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models. Predictive models often perform calculations during live transactions, for example, to evaluate the risk or opportunity of a given customer or transaction, in order to guide a decision.
===Descriptive models===
Descriptive models “describe” relationships in data in a way that is often used to classify customers or prospects into groups. Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products. But the descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do. Descriptive models are often used “offline,” for example, to categorize customers by their product preferences and life stage. Descriptive modeling tools can be utilized to develop agent based models that can simulate large number of individualized agents to predict possible futures.
===Decision models===
Decision models describe the relationship between all the elements of a decision — the known data (including results of predictive models), the decision and the forecast results of the decision — in order to predict the results of decisions involving many variables. These models can be used in optimization, a data-driven approach to improving decision logic that involves maximizing certain outcomes while minimizing others. Decision models are generally used offline, to develop decision logic or a set of business rules that will produce the desired action for every customer or circumstance.
== Predictive analytics ==
===Definition===
Predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns. The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes.
===Current uses===
Although predictive analytics can be put to use in many applications, we outline a few examples where predictive analytics has shown positive impact in recent years.
====Analytical Customer Relationship Management (CRM)====
Analytical [[Customer Relationship Management]] is a frequent commercial application of Predictive Analysis. Methods of predictive analysis are applied to customer data to pursue CRM objectives.
====Direct marketing====
Product [[marketing]] is constantly faced with the challenge of coping with the increasing number of competing products, different consumer preferences and the variety of methods (channels) available to interact with each consumer. Efficient marketing is a process of understanding the amount of variability and tailoring the marketing strategy for greater profitability. Predictive analytics can help identify consumers with a higher likelihood of responding to a particular marketing offer. Models can be built using data from consumers’ past purchasing history and past response rates for each channel. Additional information about the consumers demographic, geographic and other characteristics can be used to make more accurate predictions. Targeting only these consumers can lead to substantial increase in response rate which can lead to a significant reduction in cost per acquisition. Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of products and marketing channels that should be used to target a given consumer.
====Cross-sell====
Often corporate organizations collect and maintain abundant data (e.g. customer records, sale transactions) and exploiting hidden relationships in the data can provide a competitive advantage to the organization. For an organization that offers multiple products, an analysis of existing customer behavior can lead to efficient [[cross-selling|cross sell]] of products. This directly leads to higher profitability per customer and strengthening of the customer relationship. Predictive analytics can help analyze customers’ spending, usage and other behavior, and help cross-sell the right product at the right time.
====Customer retention====
With the amount of competing services available, businesses need to focus efforts on maintaining continuous [[consumer satisfaction]]. In such a competitive scenario, [[consumer loyalty]] needs to be rewarded and [[customer attrition]] needs to be minimized. Businesses tend to respond to customer attrition on a reactive basis, acting only after the customer has initiated the process to terminate service. At this stage, the chance of changing the customer’s decision is almost impossible. Proper application of predictive analytics can lead to a more proactive retention strategy. By a frequent examination of a customer’s past service usage, service performance, spending and other behavior patterns, predictive models can determine the likelihood of a customer wanting to terminate service sometime in the near future. An intervention with lucrative offers can increase the chance of retaining the customer. Silent attrition is the behavior of a customer to slowly but steadily reduce usage and is another problem faced by many companies. Predictive analytics can also predict this behavior accurately and before it occurs, so that the company can take proper actions to increase customer activity.
====Underwriting====
Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk. For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver. A financial company needs to assess a borrower’s potential and ability to pay before granting a loan. For a health insurance provider, predictive analytics can analyze a few years of past medical claims data, as well as lab, pharmacy and other records where available, to predict how expensive an enrollee is likely to be in the future. Predictive analytics can help [[underwriting]] of these quantities by predicting the chances of illness, [[Default (finance)|default]], [[bankruptcy]], etc. Predictive analytics can streamline the process of customer acquisition, by predicting the future risk behavior of a customer using application level data. Proper predictive analytics can lead to proper pricing decisions, which can help mitigate future risk of default.
====Collection analytics====
Every portfolio has a set of delinquent customers who do not make their payments on time. The financial institution has to undertake collection activities on these customers to recover the amounts due. A lot of collection resources are wasted on customers who are difficult or impossible to recover. Predictive analytics can help optimize the allocation of collection resources by identifying the most effective collection agencies, contact strategies, legal actions and other strategies to each customer, thus significantly increasing recovery at the same time reducing collection costs.
====Fraud detection====
Fraud is a big problem for many businesses and can be of various types. Inaccurate credit applications, fraudulent transactions, [[identity theft]]s and false insurance claims are some examples of this problem. These problems plague firms all across the spectrum and some examples of likely victims are [[Credit card fraud|credit card issuers]], insurance companies, retail merchants, manufacturers, business to business suppliers and even services providers. This is an area where a predictive model is often used to help weed out the “bads” and reduce a business's exposure to fraud.
====Portfolio, product or economy level prediction====
Often the focus of analysis is not the consumer but the product, portfolio, firm, industry or even the economy. For example a retailer might be interested in predicting store level demand for inventory management purposes. Or the Federal Reserve Board might be interested in predicting the unemployment rate for the next year. These type of problems can be addressed by predictive analytics using Time Series techniques (see below).
Wrong Information....
==Statistical techniques==
The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.
====Regression Techniques====
Regression models are the mainstay of predictive analytics. The focus lies on establishing a mathematical equation as a model to represent the interactions between the different variables in consideration. Depending on the situation, there is a wide variety of models that can be applied while performing predictive analytics. Some of them are briefly discussed below.
=====Linear Regression Model=====
The linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables. This relationship is expressed as an equation that predicts the response variable as a linear function of the parameters. These parameters are adjusted so that a measure of fit is optimized. Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions.
The goal of regression is to select the parameters of the model so as to minimize the sum of the squared residuals. This is referred to as '''[[ordinary least squares]]''' (OLS) estimation and results in best linear unbiased estimates (BLUE) of the parameters if and only if the [[Gauss–Markov theorem|Gauss-Markowitz]] assumptions are satisfied.
Once the model has been estimated we would be interested to know if the predictor variables belong in the model – i.e. is the estimate of each variable’s contribution reliable? To do this we can check the statistical significance of the model’s coefficients which can be measured using the t-statistic. This amounts to testing whether the coefficient is significantly different from zero. How well the model predicts the dependent variable based on the value of the independent variables can be assessed by using the R² statistic. It measures predictive power of the model i.e. the proportion of the total variation in the dependent variable that is “explained” (accounted for) by variation in the independent variables.
====Discrete choice models====
Multivariate regression (above) is generally used when the response variable is continuous and has an unbounded range. Often the response variable may not be continuous but rather discrete. While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, some of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques such as discrete choice models which are better suited for this type of analysis. If the dependent variable is discrete, some of those superior methods are [[logistic regression]], [[multinomial logit]] and [[probit]] models. Logistic regression and probit models are used when the dependent variable is [[binary numeral system|binary]].
=====Logistic regression=====
In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allison’s Logistic Regression for more information on the theory of Logistic Regression).
The [[Wald test|Wald]] and [[likelihood-ratio test]] are used to test the statistical significance of each coefficient b in the model (analogous to the t tests used in OLS regression; see above). A test assessing the goodness-of-fit of a classification model is the [[Hosmer and Lemeshow test]].
=====Multinomial logistic regression=====
An extension of the [[binary logit model]] to cases where the dependent variable has more than 2 categories is the [[multinomial logit model]]. In such cases collapsing the data into two categories might not make good sense or may lead to loss in the richness of the data. The multinomial logit model is the appropriate technique in these cases, especially when the dependent variable categories are not ordered (for examples colors like red, blue, green). Some authors have extended multinomial regression to include feature selection/importance methods such as [[Random multinomial logit]].
=====Probit regression=====
Probit models offer an alternative to logistic regression for modeling categorical dependent variables. Even though the outcomes tend to be similar, the underlying distributions are different. Probit models are popular in social sciences like economics.
A good way to understand the key difference between probit and logit models, is to assume that there is a latent variable z.
We do not observe z but instead observe y which takes the value 0 or 1. In the logit model we assume that follows a logistic distribution. In the probit model we assume that follows a standard normal distribution. Note that in social sciences (example economics), probit is often used to model situations where the observed variable y is continuous but takes values between 0 and 1.
=====Logit vs. Probit=====
The Probit model has been around longer than the logit model. They look identical, except that the logistic distribution tends to be a little flat tailed. In fact one of the reasons the logit model was formulated was that the probit model was extremely hard to compute because it involved calculating difficult integrals. Modern computing however has made this computation fairly simple. The coefficients obtained from the logit and probit model are also fairly close. However the odds ratio makes the logit model easier to interpret.
For practical purposes the only reasons for choosing the probit model over the logistic model would be:
* There is a strong belief that the underlying distribution is normal
* The actual event is not a binary outcome (e.g. Bankrupt/not bankrupt) but a proportion (e.g. Proportion of population at different debt levels).
==== Time series models====
[[Time series]] models are used for predicting or forecasting the future behavior of variables. These models account for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for. As a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend, seasonal and cyclical component of the series. Modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future.
Time series models estimate difference equations containing stochastic components. Two commonly used forms of these models are [[autoregressive model]]s (AR) and [[Moving average (technical analysis)|moving average]] (MA) models. The [[Box-Jenkins]] methodology (1976) developed by George Box and G.M. Jenkins combines the AR and MA models to produce the [[Autoregressive moving average model|ARMA]] (autoregressive moving average) model which is the cornerstone of stationary time series analysis. ARIMA (autoregressive integrated moving average models) on the other hand are used to describe non-stationary time series. Box and Jenkins suggest differencing a non stationary time series to obtain a stationary series to which an ARMA model can be applied. Non stationary time series have a pronounced trend and do not have a constant long-run mean or variance.
Box and Jenkins proposed a three stage methodology which includes: model identification, estimation and validation. The identification stage involves identifying if the series is stationary or not and the presence of seasonality by examining plots of the series, autocorrelation and partial autocorrelation functions. In the estimation stage, models are estimated using non-linear time series or maximum likelihood estimation procedures. Finally the validation stage involves diagnostic checking such as plotting the residuals to detect outliers and evidence of model fit.
In recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as ARCH ([[autoregressive conditional heteroskedasticity]]) and GARCH (generalized autoregressive conditional heteroskedasticity) models frequently used for financial time series. In addition time series models are also used to understand inter-relationships among economic variables represented by systems of equations using VAR (vector autoregression) and structural VAR models.
==== Survival or duration analysis====
[[Survival analysis]] is another name for time to event analysis. These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social sciences like economics, as well as in engineering (reliability and failure time analysis).
Censoring and non-normality which are characteristic of survival data generate difficulty when trying to analyze the data using conventional statistical models such as multiple linear regression. The Normal distribution, being a symmetric distribution, takes positive as well as negative values, but duration by its very nature cannot be negative and therefore normality cannot be assumed when dealing with duration/survival data. Hence the normality assumption of regression models is violated.
A censored observation is defined as an observation with incomplete information. Censoring introduces distortions into traditional statistical methods and is essentially a defect of the sample data. The assumption is that if the data were not censored it would be representative of the population of interest. In survival analysis, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time.
An important concept in survival analysis is the hazard rate. The hazard rate is defined as the probability that the event will occur at time t conditional on surviving until time t. Another concept related to the hazard rate is the survival function which can be defined as the probability of surviving to time t.
Most models try to model the hazard rate by choosing the underlying distribution depending on the shape of the hazard function. A distribution whose hazard function slopes upward is said to have positive duration dependence, a decreasing hazard shows negative duration dependence whereas constant hazard is a process with no memory usually characterized by the exponential distribution. Some of the distributional choices in survival models are: F, gamma, Weibull, log normal, inverse normal, exponential etc. All these distributions are for a non-negative random variable.
Duration models can be parametric, non-parametric or semi-parametric. Some of the models commonly used are Kaplan-Meier, Cox proportional hazard model (non parametric).
==== Classification and regression trees====
Classification and regression trees (CART) is a [[non-parametric statistics|non-parametric]] technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.
Trees are formed by a collection of rules based on values of certain variables in the modeling data set
* Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable
* Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure)
* Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met
Each branch of the tree ends in a terminal node
* Each observation falls into one and exactly one terminal node
* Each terminal node is uniquely defined by a set of rules
A very popular method for predictive analytics is Leo Breiman's [[Random forests]] or derived versions of this technique like [[Random multinomial logit]].
==== Multivariate adaptive regression splines====
[[Multivariate adaptive regression splines]] (MARS) is a [[Non-parametric statistics|non-parametric]] technique that builds flexible models by fitting [[piecewise linear regression]]s.
An important concept associated with regression splines is that of a knot. Knot is where one local regression model gives way to another and thus is the point of intersection between two splines.
In multivariate and adaptive regression splines, [[basis function]]s are the tool used for generalizing the search for knots. Basis functions are a set of functions used to represent the information contained in one or more variables.
Multivariate and Adaptive Regression Splines model almost always creates the basis functions in pairs.
Multivariate and adaptive regression spline approach deliberately overfits the model and then prunes to get to the optimal model. The algorithm is computationally very intensive and in practice we are required to specify an upper limit on the number of basis functions.
=== Machine learning techniques===
[[Machine learning]], a branch of artificial intelligence, was originally employed to develop techniques to enable computers to learn. Today, since it includes a number of advanced statistical methods for regression and classification, it finds application in a wide variety of fields including [[medical diagnostics]], [[credit card fraud detection]], [[Face recognition|face]] and [[speech recognition]] and analysis of the [[stock market]]. In certain applications it is sufficient to directly predict the dependent variable without focusing on the underlying relationships between variables. In other cases, the underlying relationships can be very complex and the mathematical form of the dependencies unknown. For such cases, machine learning techniques emulate [[human cognition]] and learn from training examples to predict future events.
A brief discussion of some of these methods used commonly for predictive analytics is provided below. A detailed study of machine learning can be found in Mitchell (1997).
==== Neural networks====
[[Neural networks]] are [[Nonlinearity|nonlinear]] sophisticated modeling techniques that are able to [[Model (abstract)|model]] complex functions. They can be applied to problems of [[Time series|prediction]], [[Statistical classification|classification]] or [[Control theory|control]] in a wide spectrum of fields such as [[finance]], [[cognitive psychology]]/[[cognitive neuroscience|neuroscience]], [[medicine]], [[engineering]], and [[physics]].
Neural networks are used when the exact nature of the relationship between inputs and output is not known. A key feature of neural networks is that they learn the relationship between inputs and output through training. There are two types of training in neural networks used by different networks, [[Supervised learning|supervised]] and [[Unsupervised learning|unsupervised]] training, with supervised being the most common one.
Some examples of neural network training techniques are [[backpropagation]], quick propagation, [[Conjugate gradient method|conjugate gradient descent]], [[Radial basis function|projection operator]], Delta-Bar-Delta etc. Theses are applied to network architectures such as multilayer [[perceptron]]s, [[Self-organizing map|Kohonen network]]s, [[Hopfield network]]s, etc.
====Radial basis functions====
A [[radial basis function]] (RBF) is a function which has built into it a distance criterion with respect to a center. Such functions can be used very efficiently for interpolation and for smoothing of data. Radial basis functions have been applied in the area of [[neural network]]s where they are used as a replacement for the sigmoidal transfer function. Such networks have 3 layers, the input layer, the hidden layer with the RBF non-linearity and a linear output layer. The most popular choice for the non-linearity is the Gaussian. RBF networks have the advantage of not being locked into local minima as do the [[feed-forward]] networks such as the multilayer perceptron.
==== Support vector machines====
[[Support Vector Machine]]s (SVM) are used to detect and exploit complex patterns in data by clustering, classifying and ranking the data. They are learning machines that are used to perform binary classifications and regression estimations. They commonly use kernel based methods to apply linear classification techniques to non-linear classification problems. There are a number of types of SVM such as linear, polynomial, sigmoid etc.
==== Naïve Bayes====
[[Naive Bayes classifier|Naïve Bayes]] based on Bayes conditional probability rule is used for performing classification tasks. Naïve Bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret. It is best employed when faced with the problem of ‘curse of dimensionality’ i.e. when the number of predictors is very high.
==== k-nearest neighbours====
The [[K-nearest neighbor algorithm|nearest neighbour algorithm]] (KNN) belongs to the class of pattern recognition statistical methods. The method does not impose a priori any assumptions about the distribution from which the modeling sample is drawn. It involves a training set with both positive and negative values. A new sample is classified by calculating the distance to the nearest neighbouring training case. The sign of that point will determine the classification of the sample. In the k-nearest neighbour classifier, the k nearest points are considered and the sign of the majority is used to classify the sample. The performance of the kNN algorithm is influenced by three main factors: (1) the distance measure used to locate the nearest neighbours; (2) the decision rule used to derive a classification from the k-nearest neighbours; and (3) the number of neighbours used to classify the new sample. It can be proved that, unlike other methods, this method is universally asymptotically convergent, i.e.: as the size of the training set increases, if the observations are iid, regardless of the distribution from which the sample is drawn, the predicted class will converge to the class assignment that minimizes misclassification error. See Devroy et alt.
==Popular tools==
There are numerous tools available in the marketplace which help with the execution of predictive analytics. These range from those which need very little user sophistication to those that are designed for the expert practitioner. The difference between these tools is often in the level of customization and heavy data lifting allowed. For traditional statistical modeling some of the popular tools are [[DAP (software)|DAP]]/[[SAS Institute|SAS]], S-Plus, [[PSPP]]/[[SPSS]] and Stata. For machine learning/data mining type of applications, KnowledgeSEEKER, KnowledgeSTUDIO, Enterprise Miner, GeneXproTools, [[Viscovery]], Clementine, [[KXEN Inc.|KXEN Analytic Framework]], [[InforSense]] and Excel Miner are some of the popularly used options. Classification Tree analysis can be performed using CART software. SOMine is a predictive analytics tool based on [[self-organizing map]]s (SOMs) available from [[Viscovery Software]]. [[R (programming_language)|R]] is a very powerful tool that can be used to perform almost any kind of statistical analysis, and is freely downloadable. [[WEKA]] is a freely available [[open source|open-source]] collection of [[machine learning]] methods for pattern classification, regression, clustering, and some types of meta-learning, which can be used for predictive analytics. [[RapidMiner]] is another freely available integrated [[open source|open-source]] software environment for predictive analytics, [[data mining]], and [[machine learning]] fully integrating WEKA and providing an even larger number of methods for predictive analytics.
Recently, in an attempt to provide a standard language for expressing predictive models, the [[Predictive Model Markup Language]] (PMML) has been proposed. Such an XML-based language provides a way for the different tools to define predictive models and to share these between PMML compliant applications. Several tools already produce or consume PMML documents, these include [[ADAPA]], [[IBM DB2]] Warehouse, CART, SAS Enterprise Miner, and [[SPSS]].
Predictive analytics has also found its way into the IT lexicon, most notably in the area of IT Automation. Vendors such as [[Stratavia]] and their [[Data Palette]] product offer predictive analytics as part of their automation platform, predicting how resources will behave in the future and automate the environment accordingly.
The widespread use of predictive analytics in industry has led to the proliferation of numerous productized solutions firms. Some of them are highly specialized (focusing, for example, on fraud detection, automatic saleslead generation or response modeling) in a specific domain ([[Fair Isaac]] for credit card scores) or industry verticals (MarketRx in Pharmaceutical). Others provide predictive analytics services in support of a wide range of business problems across industry verticals ([[Fifth C]]). Predictive Analytics competitions are also fairly common and often pit academics and Industry practitioners (see for example, KDD CUP).
==Conclusion==
Predictive analytics adds great value to a businesses decision making capabilities by allowing it to formulate smart policies on the basis of predictions of future outcomes. A broad range of tools and techniques are available for this type of analysis and their selection is determined by the analytical maturity of the firm as well as the specific requirements of the problem being solved.
==Education==
Predictive analytics is taught at the following institutions:
* Ghent University, Belgium: [http://www.mma.UGent.be Master of Marketing Analysis], an 8-month advanced master degree taught in English with strong emphasis on applications of predictive analytics in Analytical CRM.
RapidMiner
'''RapidMiner''' (formerly YALE (Yet Another Learning Environment)) is an environment for [[machine learning]] and [[data mining]] experiments. It allows experiments to be made up of a large number of arbitrarily nestable operators, described in [[XML]] files which can easily be created with RapidMiner's [[graphical user interface]]. Applications of RapidMiner cover both research and real-world data mining tasks.
The initial version has been developed by the Artificial Intelligence Unit of [[Dortmund University of Technology|University of Dortmund]] since [[2001]]. It is distributed under a [[GNU]] license, and has been hosted by [[SourceForge]] since [[2004]].
RapidMiner provides more than 400 operators for all main machine learning procedures, including input and output, and data preprocessing and visualization. It is written in the [[Java (programming language)|Java programming language]] and therefore can work on all popular operating systems. It also integrates all learning schemes and attribute evaluators of the [[Weka (machine learning)|Weka]] learning environment.
== Properties ==
Some properties of RapidMiner are:
* written in Java
* [[knowledge discovery]] processes are modeled as operator trees
* internal XML representation ensures standardized interchange format of data mining experiments
* scripting language allows for automatic large-scale experiments
* multi-layered data view concept ensures efficient and transparent data handling
* [[graphical user interface]], [[command line]] mode ([[Batch file|batch mode]]), and [[Java API]] for using RapidMiner from your own programs
* [[plugin]] and [[Extension (computing)|extension]] mechanisms, several plugins already exist
* [[plotting]] facility offering a large set of high-dimensional visualization schemes for data and models
* applications include [[text mining]], multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining.
Russian language
'''Russian''' ([[:Media:Ru-russkiy jizyk.ogg|]]([[Wikipedia:Media help|help]]•[[:Image:Ru-russkiy jizyk.ogg|info]]), [[Romanization of Russian|transliteration]]: , {{IPA-ru|ˈruskʲɪj jɪˈzɨk}}) is the most geographically widespread language of [[Eurasia]], the most widely spoken of the [[Slavic languages]], and the largest [[native language]] in [[Europe]]. Russian belongs to the family of [[Indo-European languages]] and is one of three (or, according to some authorities , four) living members of the [[East Slavic languages]], the others being [[Belarusian language|Belarusian]] and [[Ukrainian language|Ukrainian]] (and possibly [[Rusyn language|Rusyn]], often considered a dialect of Ukrainian). It is also spoken by the countries of the [[Russophone]].
Written examples of Old East Slavonic are attested from the 10th century onwards. Today Russian is widely used outside [[Russia]]. It is applied as a means of coding and storage of universal knowledge — 60–70% of all world information is published in English and Russian languages. Over a quarter of the world's scientific literature is published in Russian. Russian is also a necessary accessory of world communications systems (broadcasts, air- and space communication, etc). Due to the status of the [[Soviet Union]] as a [[superpower]], Russian had great political importance in the 20th century. Hence, the language is one of the [[United Nations#Languages|official languages]] of the [[United Nations]].
Russian distinguishes between [[consonant]] [[phoneme]]s with [[palatalization|palatal]] [[secondary articulation]] and those without, the so-called ''soft'' and ''hard'' sounds. This distinction is found between pairs of almost all consonants and is one of the most distinguishing features of the language. Another important aspect is the [[vowel reduction|reduction]] of [[stress (linguistics)|unstressed]] [[vowel]]s, which is somewhat similar to [[Unstressed and reduced vowels in English|that of English]]. Stress, which is unpredictable, is not normally indicated orthographically. According to the Institute of Russian Language of the Russian Academy of Sciences, an optional [[acute accent]] () may, and sometimes should, be used to mark stress. For example, it is used to distinguish between otherwise identical words, especially when context doesn't make it obvious: ''замо́к/за́мок'' (lock/castle), ''сто́ящий/стоя́щий'' (worthwhile/standing), ''чудно́/чу́дно'' (this is odd/this is marvellous), ''молоде́ц/мо́лодец'' (attaboy/fine young man), ''узна́ю/узнаю́'' (I shall learn it/I am learning it), ''отреза́ть/отре́зать'' (infinitive for "cut"/perfective for "cut"); to indicate the proper pronouncation of uncommon words, especially personal and family names (''афе́ра, гу́ру, Гарси́а, Оле́ша, Фе́рми''), and to express the stressed word in the sentence (''Ты́ съел печенье?/Ты съе́л печенье?/Ты съел пече́нье?'' - Was it you who eat the cookie?/Did you eat the cookie?/Was the cookie your meal?). Acute accents are mandatory in lexical dictionaries and books intended to be used either by children or foreign readers.
==Classification==
Russian is a [[Slavic languages|Slavic language]] in the [[Indo-European Languages|Indo-European family]]. From the point of view of the [[spoken language]], its closest relatives are [[Ukrainian language|Ukrainian]] and [[Belarusian language|Belarusian]], the other two national languages in the [[East Slavic languages|East Slavic]] group. In many places in eastern [[Ukraine]] and [[Belarus]], these languages are spoken interchangeably, and in certain areas traditional bilingualism resulted in language mixture, e.g. [[Surzhyk]] in eastern Ukraine and [[Trasianka]] in Belarus. An East Slavic [[Old Novgorod dialect]], although vanished during the fifteenth or sixteenth century, is sometimes considered to have played a significant role in formation of the modern Russian language.
The vocabulary (mainly abstract and literary words), principles of word formation, and, to some extent, inflections and literary style of Russian have been also influenced by [[Church Slavonic language|Church Slavonic]], a developed and partly adopted form of the [[South Slavic languages|South Slavic]] [[Old Church Slavonic]] language used by the [[Russian Orthodox Church]]. However, the East Slavic forms have tended to be used exclusively in the various dialects that are experiencing a rapid decline. In some cases, both the [[East Slavic languages|East Slavic]] and the [[Church Slavonic]] forms are in use, with slightly different meanings. ''For details, see [[Russian phonology]] and [[History of the Russian language]].''
Russian phonology and syntax (especially in northern dialects) have also been influenced to some extent by the numerous Finnic languages of the [[Finno-Ugric languages|Finno-Ugric subfamily]]: [[Merya language|Merya]], [[Moksha language|Moksha]], [[Muromian language|Muromian]], the language of the [[Meshchera]], [[Veps language|Veps]], et cetera. These languages, some of them now extinct, used to be spoken in the center and in the north of what is now the European part of Russia. They came in contact with Eastern Slavic as far back as the early Middle Ages and eventually served as substratum for the modern Russian language. The Russian dialects spoken north, north-east and north-west of [[Moscow]] have a considerable number of words of Finno-Ugric origin. Over the course of centuries, the vocabulary and literary style of Russian have also been influenced by Turkic/Caucasian/Central Asian languages, as well as Western/Central European languages such as [[Polish language|Polish]], [[Latin]], [[Dutch language|Dutch]], [[German language|German]], [[French language|French]], and [[English language|English]].
According to the [[Defense Language Institute]] in [[Monterey, California]], Russian is classified as a level III language in terms of learning difficulty for native English speakers, requiring approximately 780 hours of immersion instruction to achieve intermediate fluency. It is also regarded by the [[United States Intelligence Community]] as a "hard target" language, due to both its difficulty to master for English speakers as well as due to its critical role in American world policy.
==Geographic distribution==
Russian is primarily spoken in [[Russia]] and, to a lesser extent, the other countries that were once constituent republics of the [[Soviet Union|USSR]]. Until [[1917]], it was the sole official language of the [[Russian Empire]]. During the Soviet period, the policy toward the languages of the various other ethnic groups fluctuated in practice. Though each of the constituent republics had its own official language, the unifying role and superior status was reserved for Russian. Following the break-up of [[1991]], several of the newly independent states have encouraged their native languages, which has partly reversed the privileged status of Russian, though its role as the language of post-Soviet national intercourse throughout the region has continued.
In [[Latvia]], notably, its official recognition and legality in the classroom have been a topic of considerable debate in a country where more than one-third of the population is Russian-speaking, consisting mostly of post-[[World War II]] immigrants from Russia and other parts of the former [[USSR]] (Belarus, Ukraine). Similarly, in [[Estonia]], the Soviet-era immigrants and their Russian-speaking descendants constitute 25,6% of the country's current population and 58,6% of the native Estonian population is also able to speak Russian. In all, 67,8% of Estonia's population can speak Russian.
In [[Kazakhstan]] and [[Kyrgyzstan]], Russian remains a co-official language with [[Kazakh language|Kazakh]] and [[Kyrgyz language|Kyrgyz]] respectively. Large Russian-speaking communities still exist in northern Kazakhstan, and ethnic Russians comprise 25.6 % of Kazakhstan's population.
A much smaller Russian-speaking minority in [[Lithuania]] has represented less than 1/10 of the country's overall population. Nevertheless more than half of the population of the [[Baltic states]] are able to hold a conversation in Russian and almost all have at least some familiarity with the most basic spoken and written phrases. The Russian control of [[Finland]] in 1809–1918, however, has left few Russian speakers in Finland. There are 33,400 Russian speakers in Finland, amounting to 0.6% of the population. 5000 (0.1%) of them are late 19th century and 20th century immigrants, and the rest are recent immigrants, who have arrived in the 90's and later.
In the twentieth century, Russian was widely taught in the schools of the members of the old [[Warsaw Pact]] and in other [[Communist state|countries]] that used to be allies of the USSR. In particular, these countries include [[Poland]], [[Bulgaria]], the [[Czech Republic]], [[Slovakia]], [[Hungary]], [[Romania]], [[Albania]] and [[Cuba]]. However, younger generations are usually not fluent in it, because Russian is no longer mandatory in the school system. It is currently the most widely-taught foreign language in [[Mongolia]].
Russian is also spoken in [[Israel]] by at least 750,000 ethnic [[Jew]]ish immigrants from the former [[Soviet Union]] (1999 census). The Israeli [[Mass media|press]] and [[website]]s regularly publish material in Russian.
Sizable Russian-speaking communities also exist in [[North America]], especially in large urban centers of the [[United States|U.S.]] and [[Canada]] such as [[New York City]], [[Philadelphia]], [[Boston, Massachusetts|Boston]], [[Los Angeles, California|Los Angeles]], [[San Francisco]], [[Seattle]], [[Toronto]], [[Baltimore]], [[Miami, Florida|Miami]], [[Chicago]], [[Denver]], and the [[Cleveland, Ohio|Cleveland]] suburb of [[Richmond Heights, Ohio|Richmond Heights]]. In the former two, Russian-speaking groups total over half a million. In a number of locations they issue their own newspapers, and live in their self-sufficient neighborhoods (especially the generation of immigrants who started arriving in the early sixties). Only about a quarter of them are ethnic Russians, however. Before the [[dissolution of the Soviet Union]], the overwhelming majority of [[Russophone]]s in North America were Russian-speaking [[Jews]]. Afterwards the influx from the countries of the former [[Soviet Union]] changed the statistics somewhat. According to the [[United States 2000 Census]], Russian is the primary language spoken in the homes of over 700,000 individuals living in the United States.
Significant Russian-speaking groups also exist in [[Western Europe]]. These have been fed by several waves of immigrants since the beginning of the twentieth century, each with its own flavor of language. [[Germany]], the [[United Kingdom]], [[Spain]], [[France]], [[Italy]], [[Belgium]], [[Greece]], [[Brazil]], [[Norway]], [[Austria]], and [[Turkey]] have significant Russian-speaking communities totaling 3 million people.
Two thirds of them are actually Russian-speaking descendants of [[German people|Germans]], [[Greeks]], [[Jews]], [[Armenians]], or [[Ukrainians]] who either repatriated after the [[USSR]] collapsed or are just looking for temporary employment.
Recent estimates of the total number of speakers of Russian:
===Official status===
Russian is the official language of [[Russia]]. It is also an official language of [[Belarus]], [[Kazakhstan]], [[Kyrgyzstan]], an unofficial but widely spoken language in [[Ukraine]] and the de facto official language of the [[List of unrecognized countries|unrecognized]] of [[Transnistria]], [[South Ossetia]] and [[Abkhazia]]. Russian is one of the [[United Nations#Languages|six official languages]] of the [[United Nations]]. Education in Russian is still a popular choice for both Russian as a second language (RSL) and native speakers in Russia as well as many of the former Soviet republics.
97% of the public school students of Russia, 75% in Belarus, 41% in Kazakhstan, 25% in [[Ukraine]], 23% in Kyrgyzstan, 21% in [[Moldova]], 7% in [[Azerbaijan]], 5% in [[Georgia (country)|Georgia]] and 2% in [[Armenia]] and [[Tajikistan]] receive their education only or mostly in Russian. Although the corresponding percentage of ethnic Russians is 78% in [[Russia]], 10% in [[Belarus]], 26% in [[Kazakhstan]], 17% in [[Ukraine]], 9% in [[Kyrgyzstan]], 6% in [[Republic of Moldova|Moldova]], 2% in [[Azerbaijan]], 1.5% in [[Georgia (country)|Georgia]] and less than 1% in both [[Armenia]] and [[Tajikistan]].
Russian-language schooling is also available in Latvia, Estonia and Lithuania, but due to education reforms, a number of subjects taught in Russian are reduced at the high school level. The language has a co-official status alongside [[Moldovan language|Moldovan]] in the autonomies of [[Gagauzia]] and [[Transnistria]] in [[Moldova]], and in seven [[Romania]]n [[Commune in Romania|communes]] in [[Tulcea County|Tulcea]] and [[Constanţa County|Constanţa]] counties. In these localities, Russian-speaking [[Lipovans]], who are a recognized ethnic minority, make up more than 20% of the population. Thus, according to Romania's minority rights law, education, signage, and access to public administration and the justice system are provided in Russian alongside Romanian. In the [[Crimea|Autonomous Republic of Crimea]] in Ukraine, Russian is an officially recognized language alongside with [[Crimean Tatar language|Crimean Tatar]], but in reality, is the only language used by the government, thus being a ''[[de facto]]'' official language.
===Dialects===
Despite leveling after 1900, especially in matters of vocabulary, a number of dialects exist in Russia. Some linguists divide the dialects of the Russian language into two primary regional groupings, "Northern" and "Southern", with [[Moscow]] lying on the zone of transition between the two. Others divide the language into three groupings, Northern, Central and Southern, with Moscow lying in the Central region. [[Dialectology]] within Russia recognizes dozens of smaller-scale variants.
The dialects often show distinct and non-standard features of pronunciation and intonation, vocabulary, and grammar. Some of these are relics of ancient usage now completely discarded by the standard language.
The [[northern Russian dialects]] and those spoken along the [[Volga River]] typically pronounce unstressed {{IPA|/o/}} clearly (the phenomenon called [[vowel reduction in Russian#Back vowels|okanye]]/оканье). East of Moscow, particularly in [[Ryazan Region]], unstressed {{IPA|/e/}} and {{IPA|/a/}} following [[palatalization|palatalized]] consonants and preceding a stressed syllable are not reduced to {{IPA|[ɪ]}} (like in the Moscow dialect), being instead pronounced as {{IPA|/a/}} in such positions (e.g. несл'''и''' is pronounced as {{IPA|[nʲasˈlʲi]}}, not as {{IPA|[nʲɪsˈlʲi]}}) - this is called [[yakanye]]/ яканье; many southern dialects have a palatalized final {{IPA|/tʲ/}} in 3rd person forms of verbs (this is unpalatalized in the standard dialect) and a fricative {{IPA|[ɣ]}} where the standard dialect has {{IPA|[g]}}. However, in certain areas south of Moscow, e.g. in and around [[Tula, Russia|Tula]], {{IPA|/g/}} is pronounced as in the Moscow and northern dialects unless it precedes a voiceless plosive or a pause. In this position {{IPA|/g/}} is lenited and devoiced to the fricative {{IPA|[x]}}, e.g. друг {{IPA|[drux]}} (in Moscow's dialect, only Бог {{IPA|[box]}}, лёгкий {{IPA|[lʲɵxʲkʲɪj]}}, мягкий {{IPA|[ˈmʲæxʲkʲɪj]}} and some derivatives follow this rule). Some of these features (e.g. a [[debuccalization|debuccalized]] or [[lenition|lenited]] {{IPA|/g/}} and palatalized final {{IPA|/tʲ/}} in 3rd person forms of verbs) are also present in modern [[Ukrainian language|Ukrainian]], indicating either a linguistic continuum or strong influence one way or the other.
The city of [[Veliky Novgorod]] has historically displayed a feature called chokanye/tsokanye (чоканье/цоканье), where {{IPA|/ʨ/}} and {{IPA|/ʦ/}} were confused (this is thought to be due to influence from [[Finnish language|Finnish]], which doesn't distinguish these sounds). So, '''ц'''апля ("heron") has been recorded as 'чапля'. Also, the second palatalization of [[Velar consonant|velar]]s did not occur there, so the so-called '''ě²''' (from the Proto-Slavonic diphthong *ai) did not cause {{IPA|/k, g, x/}} to shift to {{IPA|/ʦ, ʣ, s/}}; therefore where [[Standard Russian]] has '''ц'''епь ("chain"), the form '''к'''епь {{IPA|[kʲepʲ]}} is attested in earlier texts.
Among the first to study Russian dialects was [[Mikhail Lomonosov|Lomonosov]] in the eighteenth century. In the nineteenth, [[Vladimir Dal]] compiled the first dictionary that included dialectal vocabulary. Detailed mapping of Russian dialects began at the turn of the twentieth century. In modern times, the monumental ''Dialectological Atlas of the Russian Language'' (''Диалектологический атлас русского языка'' {{IPA|[dʲɪɐˌlʲɛktəlɐˈgʲiʨɪskʲɪj ˈatləs ˈruskəvə jɪzɨˈka]}}), was published in 3 folio volumes 1986–1989, after four decades of preparatory work.
The ''standard language'' is based on (but not identical to) the Moscow dialect.
===Derived languages===
* [[Balachka]] a dialect, spoken primarily by [[Cossacks]], in the regions of Don, [[Kuban]] and [[Terek]].
* [[Fenya]], a criminal [[argot]] of ancient origin, with Russian grammar, but with distinct vocabulary.
* [[Nadsat]], the fictional language spoken in '[[A Clockwork Orange]]' uses a lot of Russian words and Russian slang.
* [[Surzhyk]] is a language with Russian and Ukrainian features, spoken in some areas of Ukraine
* [[Trasianka]] is a language with Russian and Belarusian features used by a large portion of the rural population in [[Belarus]].
* [[Quelia]], a pseudo pidgin of German and Russian.
* [[Runglish]], Russian-English pidgin. This word is also used by English speakers to describe the way in which Russians attempt to speak English using Russian morphology and/or syntax.
* [[Russenorsk language|Russenorsk]] is an extinct [[pidgin]] language with mostly Russian vocabulary and mostly [[Norwegian language|Norwegian]] grammar, used for communication between [[Russians]] and [[Norway|Norwegian]] traders in the Pomor trade in [[Finnmark]] and the [[Kola Peninsula]].
==Writing system==
===Alphabet===
Russian is written using a modified version of the [[Cyrillic alphabet|Cyrillic (кириллица)]] alphabet. The Russian alphabet consists of 33 letters. The following table gives their upper case forms, along with [[help:IPA|IPA]] values for each letter's typical sound:
Older letters of the Russian alphabet include <>, which merged to <е> ({{IPA|/e/}}); <і> and <>, which both merged to <и>({{IPA|/i/}}); <>, which merged to <ф> ({{IPA|/f/}}); and <>, which merged to <я> ({{IPA|/ja/}} or {{IPA|/ʲa/}}). While these older letters have been abandoned at one time or another, they may be used in this and related articles. The [[yer]]s <ъ> and <ь> originally indicated the pronunciation of ''ultra-short'' or ''reduced'' {{IPA|/ŭ/}}, {{IPA|/ĭ/}}.
The Russian alphabet has many systems of [[character encoding]]. [[KOI8-R]] was designed by the government and was intended to serve as the standard encoding. This encoding is still used in UNIX-like operating systems. Nevertheless, the spread of [[MS-DOS]] and [[Microsoft Windows]] created chaos and ended by establishing different encodings as de-facto standards. For communication purposes, a number of conversion applications were developed. "[[iconv]]" is an example that is supported by most versions of [[Linux]], [[Macintosh]] and some other [[operating system]]s.
Most implementations (especially old ones) of the character encoding for the Russian language are aimed at simultaneous use of English and Russian characters only and do not include support for any other language. Certain hopes for a unification
of the character encoding for the Russian alphabet are related to the [[Unicode|Unicode standard]], specifically designed for peaceful coexistence
of various languages, including even [[dead language]]s. [[Unicode]] also supports the letters of the
[[Early Cyrillic alphabet]], which have many similarities with the [[Greek alphabet]].
===Orthography===
Russian spelling is reasonably phonemic in practice. It is in fact a balance among phonemics, morphology, etymology, and grammar; and, like that of most living languages, has its share of inconsistencies and controversial points. A number of rigid [[spelling rule]]s introduced between the 1880s and 1910s have been responsible for the latter whilst trying to eliminate the former.
The current spelling follows the major reform of 1918, and the final codification of 1956. An update proposed in the late 1990s has met a hostile reception, and has not been formally adopted.
The punctuation, originally based on Byzantine Greek, was in the seventeenth and eighteenth centuries reformulated on the French and German models.
==Sounds==
The phonological system of Russian is inherited from [[Common Slavonic]], but underwent considerable modification in the early historical period, before being largely settled by about 1400.
The language possesses five vowels, which are written with different letters depending on whether or not the preceding consonant is [[palatalization|palatalized]]. The consonants typically come in plain vs. palatalized pairs, which are traditionally called ''hard'' and ''soft.'' (The ''hard'' consonants are often [[velarization|velarized]], especially before back vowels, although in some dialects the velarization is limited to hard {{IPA|/l/}}). The standard language, based on the Moscow dialect, possesses heavy stress and moderate variation in pitch. Stressed vowels are somewhat lengthened, while unstressed vowels tend to be reduced to near-close vowels or an unclear [[schwa]]. (See also: [[vowel reduction in Russian]].)
The Russian [[syllable]] structure can be quite complex with both initial and final consonant clusters of up to 4 consecutive sounds. Using a formula with V standing for the nucleus (vowel) and C for each consonant the structure can be described as follows:
(C)(C)(C)(C)V(C)(C)(C)(C)
Clusters of four consonants are not very common, however, especially within a morpheme.
===Consonants===
Russian is notable for its distinction based on [[palatalization]] of most of the consonants. While {{IPA|/k/, /g/, /x/}} do have palatalized [[allophone]]s {{IPA|[kʲ, gʲ, xʲ]}}, only {{IPA|/kʲ/}} might be considered a phoneme, though it is marginal and generally not considered distinctive (the only native [[minimal pair]] which argues for {{IPA|/kʲ/}} to be a separate phoneme is "это ткёт"/"этот кот"). Palatalization means that the center of the tongue is raised during and after the articulation of the consonant. In the case of {{IPA|/tʲ/ and /dʲ/}}, the tongue is raised enough to produce slight frication (affricate sounds). These sounds: {{IPA|/t, d, ʦ, s, z, n and rʲ/}} are [[dental consonant|dental]], that is pronounced with the tip of the tongue against the teeth rather than against the [[alveolar ridge]].
==Grammar==
Russian has preserved an [[Indo-European languages|Indo-European]] [[Synthetic language|synthetic]]-[[inflection]]al structure, although considerable leveling has taken place.
Russian grammar encompasses
* a highly [[Synthetic language|synthetic]] '''morphology'''
* a '''syntax''' that, for the literary language, is the conscious fusion of three elements:
** a polished [[vernacular]] foundation;
** a [[Church Slavonic language|Church Slavonic]] inheritance;
** a [[Western Europe]]an style.
The spoken language has been influenced by the literary one, but continues to preserve characteristic forms. The dialects show various non-standard grammatical features, some of which are archaisms or descendants of old forms since discarded by the literary language.
==Vocabulary==
See [[History of the Russian language]] for an account of the successive foreign influences on the Russian language.
The total number of words in Russian is difficult to reckon because of the ability to agglutinate and create manifold compounds, diminutives, etc. (see [[Russian grammar#Word Formation|Word Formation]] under [[Russian grammar]]).
The number of listed words or entries in some of the major dictionaries published during the last two centuries, and the total vocabulary of [[Pushkin]] (who is credited with greatly augmenting and codifying literary Russian), are as follows:
(As a historical aside, [[Vladimir Ivanovich Dal|Dahl]] was, in the second half of the nineteenth century, still insisting that the proper spelling of the adjective '''русский''', which was at that time applied uniformly to all the Orthodox Eastern Slavic subjects of the Empire, as well as to its one official language, be spelled '''руский''' with one s, in accordance with ancient tradition and what he termed the "spirit of the language". He was contradicted by the philologist Grot, who distinctly heard the s lengthened or doubled).
=== Proverbs and sayings ===
The Russian language is replete with many hundreds of proverbs ('''пословица''' {{IPA|[pɐˈslo.vʲɪ.ʦə]}}) and sayings ('''поговоркa''' {{IPA|[pə.gɐˈvo.rkə]}}). These were already tabulated by the seventeenth century, and collected and studied in the nineteenth and twentieth, with the folk-tales being an especially fertile source.
==History and examples==
The history of Russian language may be divided into the following periods.
* [[History of the Russian language#Kievan period and feudal breakup|Kievan period and feudal breakup]]
* [[History of the Russian language#The Tatar yoke and the Grand Duchy of Lithuania|The Tatar yoke and the Grand Duchy of Lithuania]]
* [[History of the Russian language#The Moscovite period (15th–17th centuries)|The Moscovite period (15th–17th centuries)]]
* [[History of the Russian language#Empire (18th–19th centuries)|Empire (18th–19th centuries)]]
* [[History of the Russian language#Soviet period and beyond (20th century)|Soviet period and beyond (20th century)]]
Judging by the historical records, by approximately 1000 AD the predominant ethnic group over much of modern European [[Russia]], [[Ukraine]], and [[Belarus]] was the Eastern branch of the [[Slavic peoples|Slavs]], speaking a closely related group of dialects. The political unification of this region into [[Kievan Rus']] in about 880, from which modern Russia, Ukraine and Belarus trace their origins, established [[Old East Slavic]] as a literary and commercial language. It was soon followed by the adoption of [[Christianity]] in 988 and the introduction of the South Slavic [[Old Church Slavonic]] as the liturgical and official language. Borrowings and [[calque]]s from Byzantine [[Greek language|Greek]] began to enter the [[Old East Slavic]] and spoken dialects at this time, which in their turn modified the [[Old Church Slavonic]] as well.
Dialectal differentiation accelerated after the breakup of [[Kievan Rus]] in approximately 1100. On the territories of modern [[Belarus]] and [[Ukraine]] emerged [[Ruthenian language|Ruthenian]] and in modern [[Russia]] [[History of the Russian language|medieval Russian]]. They definitely became distinct in 13th century by the time of division of that land between the [[Grand Duchy of Lithuania]] on the west and independent Novgorod Feudal Republic plus small duchies which were vassals of the Tatars on the east.
The official language in Moscow and Novgorod, and later, in the growing Moscow Rus’, was [[Church Slavonic]] which evolved from [[Old Church Slavonic]] and remained [[Diglossia|the literary language]] until the Petrine age, when its usage shrank drastically to biblical and liturgical texts. Russian developed under a strong influence of the Church Slavonic until the close of the seventeenth century; the influence reversed afterwards leading to corruption of liturgical texts.
The political reforms of [[Peter I of Russia|Peter the Great]] were accompanied by a reform of the alphabet, and achieved their goal of secularization and Westernization. Blocks of specialized vocabulary were adopted from the languages of Western Europe. By 1800, a significant portion of the gentry spoke [[French language|French]], less often [[German language|German]], on an everyday basis. Many Russian novels of the 19th century, e.g. Lev Tolstoy’s "War and Peace", contain entire paragraphs and even pages in French with no translation given, with an assumption that educated readers won't need one.
The modern literary language is usually considered to date from the time of [[Aleksandr Pushkin]] in the first third of the nineteenth century. Pushkin revolutionized Russian literature by rejecting archaic grammar and vocabulary (so called "высокий стиль" — "high style") in favor of grammar and vocabulary found in the spoken language of the time. Even modern readers of younger age may only experience slight difficulties understanding some words in Pushkin’s texts, since only few words used by Pushkin became archaic or changed meaning. On the other hand, many expressions used by Russian writers of the early 19th century, in particular Pushkin, [[Lermontov]], [[Gogol]], Griboiädov, became proverbs or sayings which can be frequently found even in the modern Russian colloquial speech.
The political upheavals of the early twentieth century and the wholesale changes of political ideology gave written Russian its modern appearance after the spelling reform of 1918. Political circumstances and Soviet accomplishments in military, scientific, and technological matters (especially cosmonautics), gave Russian a world-wide prestige, especially during the middle third of the twentieth century.
Web search engine
A '''Web search engine''' is a [[search engine (computing)|search engine]] designed to search for information on the [[World Wide Web]]. Information may consist of [[web page]]s, images and other types of files. Some search engines also mine data available in newsbooks, databases, or [[Web directory|open directories]]. Unlike [[Web directories]], which are maintained by human editors, search engines operate algorithmically or are a mixture of [[algorithmic]] and human input.
==History==
Before there were search engines there was a complete list of all webservers. The list was edited by [[Tim Berners-Lee]] and hosted on the CERN webserver. One historical snapshot from 1992 remains. As more and more webservers went online the central list could not keep up. On the NCSA Site new servers were announced under the title "What's New!", but no complete listing existed any more.
The very first tool used for searching on the (pre-web) Internet was [[Archie search engine|Archie]].
The name stands for "archive" without the "v". It was created in 1990 by [[Alan Emtage]], a student at [[McGill University]] in Montreal. The program downloaded the directory listings of all the files located on public anonymous FTP ([[File Transfer Protocol]]) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites.
The rise of [[Gopher (protocol)|Gopher]] (created in 1991 by [[Mark McCahill]] at the [[University of Minnesota]]) led to two new search programs, [[Veronica (computer)|Veronica]] and [[Jughead (computer)|Jughead]]. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica ('''V'''ery '''E'''asy '''R'''odent-'''O'''riented '''N'''et-wide '''I'''ndex to '''C'''omputerized '''A'''rchives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead ('''J'''onzy's '''U'''niversal '''G'''opher '''H'''ierarchy '''E'''xcavation '''A'''nd '''D'''isplay) was a tool for obtaining menu information from specific Gopher servers. While the name of the search engine "[[Archie search engine|Archie]]" was not a reference to the [[Archie Comics|Archie comic book]] series, "[[Veronica Lodge|Veronica]]" and "[[Jughead Jones|Jughead]]" are characters in the series, thus referencing their predecessor.
The first Web search engine was Wandex, a now-defunct index collected by the [[World Wide Web Wanderer]], a [[web crawler]] developed by Matthew Gray at [[Massachusetts Institute of Technology|MIT]] in 1993. Another very early search engine, [[Aliweb]], also appeared in 1993. [[JumpStation]] (released in early 1994) used a crawler to find web pages for searching, but search was limited to the title of web pages only. One of the first "full text" crawler-based search engines was [[WebCrawler]], which came out in 1994. Unlike its predecessors, it let users search for any word in any webpage, which became the standard for all major search engines since. It was also the first one to be widely known by the public. Also in 1994 [[Lycos]] (which started at [[Carnegie Mellon University]]) was launched, and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity. These included [[Magellan]], [[Excite]], [[Infoseek]], [[Inktomi]], [[Northern Light Group|Northern Light]], and [[AltaVista]]. [[Yahoo!]] was among the most popular ways for people to find web pages of interest, but its search function operated on its [[web directory]], rather than full-text copies of web pages. Information seekers could also browse the directory instead of doing a keyword-based search.
In 1996, [[Netscape]] was looking to give a single search engine an exclusive deal to be their featured search engine. There was so much interest that instead a deal was struck with Netscape by 5 of the major search engines, where for $5Million per year each search engine would be in a rotation on the Netscape search engine page. These five engines were: [[Yahoo!]], [[Magellan]], [[Lycos]], [[Infoseek]] and [[Excite]].
Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s. Several companies entered the market spectacularly, receiving record gains during their [[initial public offering]]s. Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light. Many search engine companies were caught up in the [[dot-com bubble]], a speculation-driven market boom that peaked in 1999 and ended in 2001.
Around 2000, the [[Google Search|Google search engine]] rose to prominence. The company achieved better results for many searches with an innovation called [[PageRank]]. This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others. Google also maintained a minimalist interface to its search engine. In contrast, many of its competitors embedded a search engine in a [[web portal]].
By 2000, Yahoo was providing search services based on [[Inktomi]]'s search engine. Yahoo! acquired [[Inktomi]] in 2002, and [[Overture]] (which owned [[AlltheWeb]] and [[AltaVista]]) in 2003. Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search (since re-branded [[Live Search]]) in the fall of 1998 using search results from [[Inktomi]]. In early 1999 the site began to display listings from [[Looksmart]] blended with results from [[Inktomi]] except for a short time in 1999 when results from [[AltaVista]] were used instead. In 2004, Microsoft began a transition to its own search technology, powered by its own [[web crawler]] (called [[msnbot]]).
As of late 2007, Google was by far the most popular Web search engine worldwide.
A number of country-specific search engine companies have become prominent; for example [[Baidu]] is the most popular search engine in the [[People's Republic of China]] and [[guruji.com]] in [[India]].
==How Web search engines work==
A search engine operates, in the following order
# [[Web crawling]]
# [[Index (search engine)|Indexing]]
# [[Web search query|Searching]]
Web search engines work by storing information about many web pages, which they retrieve from the WWW itself. These pages are retrieved by a [[Web crawler]] (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Exclusions can be made by the use of [[robots.txt]]. The contents of each page are then analyzed to determine how it should be [[Search engine indexing|indexed]] (for example, words are extracted from the titles, headings, or special fields called [[meta tags]]). Data about web pages are stored in an index database for use in later queries. Some search engines, such as [[Google]], store all or part of the source page (referred to as a [[web cache|cache]]) as well as information about the web pages, whereas others, such as [[AltaVista]], store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of [[linkrot]], and Google's handling of it increases [[usability]] by satisfying [[user expectations]] that the search terms will be on the returned webpage. This satisfies the [[principle of least astonishment]] since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
When a user enters a [[web search query|query]] into a search engine (typically by using [[Keyword (Internet search)|key word]]s), the engine examines its [[inverted index|index]] and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the [[boolean operators]] AND, OR and NOT to further specify the [[web search query|search query]]. Some search engines provide an advanced feature called [[Proximity search (text)|proximity search]] which allows users to define the distance between keywords.
The usefulness of a search engine depends on the [[relevance (information retrieval)|relevance]] of the '''result set''' it gives back. While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to [[rank order|rank]] the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve.
Most Web search engines are commercial ventures supported by [[advertising]] revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.
The vast majority of search engines are run by private companies using proprietary algorithms and closed databases, though [[List of search engines#Open source search engines|some]] are open source.
Revenue in the web search portals industry is projected to grow in 2008 by 13.4 percent, with broadband connections expected to rise by 15.1 percent. Between 2008 and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still has some way to go to reach full saturation in American households. Furthermore, broadband services are projected to account for an ever increasing share of domestic Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by fiber-optic and high speed cable lines.
Semantics
'''Semantics''' is the study of meaning in communication. The word derives from [[Greek language|Greek]] ''σημαντικός'' (''semantikos''), "significant", from ''σημαίνω'' (''semaino''), "to signify, to indicate" and that from ''σήμα'' (''sema''), "sign, mark, token". In [[linguistics]] it is the study of interpretation of signs as used by [[agent]]s or [[community|communities]] within particular circumstances and contexts. It has related meanings in several other fields.
Semanticists differ on what constitutes [[Meaning (linguistics)|meaning]] in an expression. For example, in the sentence, "John loves a bagel", the word ''bagel'' may refer to the object itself, which is its ''literal'' meaning or ''[[denotation]]'', but it may also refer to many other figurative associations, such as how it meets John's hunger, etc., which may be its ''[[connotation]]''. Traditionally, the [[formal semantic]] view restricts semantics to its literal meaning, and relegates all figurative associations to [[pragmatics]], but this distinction is increasingly difficult to defend. The degree to which a theorist subscribes to the literal-figurative distinction decreases as one moves from the [[formal semantic]], [[semiotic]], [[pragmatic]], to the [[cognitive semantic]] traditions.
The word ''semantic'' in its modern sense is considered to have first appeared in [[French language|French]] as ''sémantique'' in [[Michel Bréal]]'s 1897 book, ''Essai de sémantique'.
In [[International Scientific Vocabulary]] semantics is also called ''[[semasiology]]''.
The discipline of Semantics is distinct from [[General semantics|Alfred Korzybski's General Semantics]], which is a system for looking at non-immediate, or abstract meanings.
==Linguistics==
In [[linguistics]], '''semantics''' is the subfield that is devoted to the study of meaning, as inherent at the levels of words, phrases, sentences, and even larger units of [[discourse]] (referred to as ''texts'').
The basic area of study is the meaning of [[sign (semiotics)|sign]]s, and the study of relations between different linguistic units: [[homonym]]y, [[synonym]]y, [[antonym]]y, [[polysemy]], [[paronyms]], [[hypernym]]y, [[hyponym]]y, [[meronymy]], [[metonymy]], [[holonymy]], [[exocentric]]ity / [[endocentric]]ity, linguistic [[compound (linguistics)|compounds]]. A key concern is how meaning attaches to larger chunks of text, possibly as a result of the composition from smaller units of meaning.
Traditionally, semantics has included the study of connotative ''[[word sense|sense]]'' and denotative ''[[reference]]'', [[truth condition]]s, [[argument structure]], [[thematic role]]s, [[discourse analysis]], and the linkage of all of these to syntax. [[Formal semantics|Formal semanticists]] are concerned with the modeling of meaning in terms of the semantics of logic. Thus the sentence ''John loves a bagel'' above can be broken down into its constituents (signs), of which the unit ''loves'' may serve as both syntactic and semantic [[head (linguistics)|head]].
In the late 1960s, [[Richard Montague]] proposed a system for defining semantic entries in the lexicon in terms of [[lambda calculus]]. Thus, the syntactic [[parsing|parse]] of the sentence above
would now indicate ''loves'' as the head, and its entry in the lexicon would point to the arguments as the agent, ''John'', and the object, ''bagel'', with a special role for the article "a" (which Montague called a quantifier). This resulted in the sentence being associated with the logical predicate ''loves (John, bagel)'', thus
linking semantics to [[categorial grammar]] models of [[syntax]].
The logical predicate thus obtained would be elaborated further, e.g. using truth theory models, which ultimately relate meanings to a set of [[Tarski]]ian universals, which may lie outside the logic. The notion of such meaning atoms or primitives are basic to the [[language of thought]] hypothesis from the 70s.
Despite its elegance, [[Montague grammar]] was limited by the context-dependent variability in word sense, and led to several attempts at incorporating context, such as :
*[[situation semantics]] ('80s): Truth-values are incomplete, they get assigned based on context
*[[generative lexicon]] ('90s): categories (types) are incomplete, and get assigned based on context
===The dynamic turn in semantics===
In the [[Noam Chomsky|Chomskian]] tradition in linguistics there was no mechanism for the learning of semantic relations, and the [[Psychological nativism|nativist]] view considered all semantic notions as inborn. Thus, even novel concepts were proposed to have been dormant in some sense. This traditional view was also unable to address many issues such as [[metaphor]] or associative meanings, and [[semantic change]], where meanings within a linguistic community change over time, and [[qualia]] or subjective experience. Another issue not addressed by the nativist model was how perceptual cues are combined in thought, e.g. in [[mental rotation]].
This traditional view of semantics, as an innate finite meaning inherent in a [[lexical unit]] that can be composed to generate meanings for larger chunks of discourse, is now being fiercely debated in the emerging domain of [[cognitive linguistics]]
and also in the non-[[Jerry Fodor|Fodorian]] camp in [[Philosophy of Language]].
The challenge is motivated by
* factors internal to language, such as the problem of resolving [[indexical]] or [[anaphora]] (e.g. ''this x'', ''him'', ''last week''). In these situations "context" serves as the input, but the interpreted utterance also modifies the context, so it is also the output. Thus, the interpretation is necessarily dynamic and the meaning of sentences is viewed as context-change potentials instead of [[propositions]].
* factors external to language, i.e. language is not a set of labels stuck on things, but "a toolbox, the importance of whose elements lie in the way they function rather than their attachments to things." This view reflects the position of the later [[Wittgenstein]] and his famous ''game'' example, and is related to the positions of [[Willard Van Orman Quine|Quine]], [[Donald Davidson (philosopher)|Davidson]], and others.
A concrete example of the latter phenomenon is semantic [[underspecification]] — meanings are not complete without some elements of context. To take an example of a single word, "red", its meaning in a phrase such as ''red book'' is similar to many other usages, and can be viewed as compositional. However, the colours implied in phrases such as "red wine" (very dark), and "red hair" (coppery), or "red soil", or "red skin" are very different. Indeed, these colours by themselves would not be called "red" by native speakers. These instances are contrastive, so "red wine" is so called only in comparison with the other kind of wine (which also is not "white" for the same reasons). This view goes back to [[Ferdinand de Saussure|de Saussure]]:
:Each of a set of synonyms like ''redouter'' ('to dread'), ''craindre'' ('to fear'), ''avoir peur'' ('to be afraid') has its particular value only because they stand in contrast with one another. No word has a value that can be identified independently of what else is in its vicinity.
and may go back to earlier [[India]]n views on language, especially the [[Nyaya]] view of words as [[Semantic indicator|indicators]] and not carriers of meaning.
An attempt to defend a system based on propositional meaning for semantic underspecification can be found in the [[Generative Lexicon]] model of [[James Pustejovsky]], who extends contextual operations (based on type shifting) into the lexicon. Thus meanings are generated on the fly based on finite context.
===Prototype theory===
Another set of concepts related to fuzziness in semantics is based on
[[Prototype Theory|prototype]]s. The work of [[Eleanor Rosch]] and [[George Lakoff]]
in the 1970s led to a view that
natural categories are not characterizable in terms of
necessary and sufficient
conditions, but are graded (fuzzy at their boundaries) and inconsistent as to
the status of their constituent members.
Systems of categories are not objectively "out there" in the world but are
rooted in people's experience. These categories evolve as [[learning theory (education)|learned]] concepts
of the world — meaning is not an objective truth, but a
subjective construct, learned from experience, and language arises
out of the "grounding of our
conceptual systems in shared [[embodied philosophy|embodiment]] and bodily experience".
A corollary of this is that the conceptual categories
(i.e. the lexicon) will not be identical for
different cultures, or indeed, for every individual in the same culture. This
leads to another debate (see the [[Whorf-Sapir hypothesis]] or [[Eskimo words for snow]]).
==Computer science==
In [[computer science]], where it is considered as an application of [[mathematical logic]], semantics reflects the meaning of programs or functions.
In this regard, semantics permits programs to be separated into their syntactical part (grammatical structure) and their semantic part (meaning). For instance, the following statements use different syntaxes (languages), but result in the same semantic:
* x += y; ([[C (programming language)|C]], [[Java (programming language)|Java]], etc.)
* x := x + y; ([[Pascal (programming language)|Pascal]])
* Let x = x + y; (early [[BASIC]])
* x = x + y (most BASIC dialects, [[Fortran]])
Generally these operations would all perform an arithmetical addition of 'y' to 'x' and store the result in a variable 'x'.
Semantics for computer applications falls into three categories:
* [[Operational semantics]]: The meaning of a construct is specified by the computation it induces when it is executed on a machine. In particular, it is of interest ''how'' the effect of a computation is produced.
* [[Denotational semantics]]: Meanings are modelled by mathematical objects that represent the effect of executing the constructs. Thus ''only'' the effect is of interest, not how it is obtained.
* [[Axiomatic semantics]]: Specific properties of the effect of executing the constructs as expressed as ''assertions''. Thus there may be aspects of the executions that are ignored.
The '''[[Semantic Web]]''' refers to the extension of the [[World Wide Web]] through the embedding of additional semantic [[metadata]]; s.a. [[Web Ontology Language]] (OWL).
==Psychology==
In [[psychology]], ''[[semantic memory]]'' is memory for meaning, in other words, the aspect of memory that preserves only the ''gist'', the general significance, of remembered experience, while [[episodic memory]] is memory for the ephemeral details, the individual features, or the unique particulars of experience. Word meaning is measured by the company they keep; the relationships among words themselves in a [[semantic network]]. In a network created by people analyzing their understanding of the word (such as [[Wordnet]]) the links and decomposition structures of the network are few in number and kind; and include "part of", "kind of", and similar links. In automated [[ontologies]] the links are computed vectors without explicit meaning. Various automated technologies are being developed to compute the meaning of words: [[latent semantic indexing]] and [[support vector machines]] as well as [[natural language processing]], [[neural networks]] and [[predicate calculus]] techniques.
Semantics has been reported to drive the course of psychotherapeutic interventions. Language structure can determine the treatment approach to drug-abusing patients. . While working in Europe for the US Information Agency, American psychiatrist, Dr. A. James Giannini reported semantic differences in medical approaches to addiction treatment.. English speaking countries used the term "drug dependence" to describe a rather passive pathology in their patients. As a result the physician's role was more active. Southern European countries such as Italy and Yugoslavia utilized the concept of "tossicomania" (i.e. toxic mania) to describe a more acive rather than passive role of the addict. As a result the treating physician's role shifted to that of a more passive guide than that of an active interventionist. .
Sentence (linguistics)
In [[linguistics]], a '''sentence''' is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, etc. Sentences are generally characterized in most languages by the presence of a [[finite verb]], e.g. "[[The quick brown fox jumps over the lazy dog]]".
==Components of a sentence==
A simple ''complete sentence'' consists of a ''[[subject (grammar)|subject]]'' and a ''[[predicate (grammar)|predicate]]''. The subject is typically a [[noun phrase]], though other kinds of phrases (such as [[gerund]] phrases) work as well, and some languages allow subjects to be omitted. The predicate is a finite [[verb phrase]]: it's a finite verb together with zero or more [[object (grammar)|objects]], zero or more [[complement (linguistics)|complements]], and zero or more [[adverbial]]s. See also [[copula]] for the consequences of this verb on the theory of sentence structure.
===Clauses===
A [[clause]] consists of a subject and a verb. There are two types of clauses: independent and subordinate (dependent). An independent clause consists of a subject verb and also demonstrates a complete thought: for example, "I am sad." A subordinate clause consists of a subject and a verb, but demonstrates an incomplete thought: for example, "Because I had to move."
==Classification==
===By structure===
One traditional scheme for classifying [[English language|English]] sentences is by the number and types of [[finite verb|finite]] [[clause]]s:
* A ''[[simple sentence]]'' consists of a single [[independent clause]] with no [[dependent clause]]s.
* A ''[[compound sentence (linguistics)|compound sentence]]'' consists of multiple independent clauses with no dependent clauses. These clauses are joined together using [[grammatical conjunction|conjunctions]], [[punctuation]], or both.
* A ''[[complex sentence]]'' consists of one or more independent clauses with at least one dependent clause.
* A ''[[complex-compound sentence]]'' (or ''compound-complex sentence'') consists of multiple independent clauses, at least one of which has at least one dependent clause.
===By purpose===
Sentences can also be classified based on their purpose:
*A ''declarative sentence'' or ''declaration'', the most common type, commonly makes a statement: ''I am going home.''
*A ''negative sentence'' or ''[[negation (linguistics)|negation]]'' denies that a statement is true: ''I am not going home.''
*An ''interrogative sentence'' or ''[[question]]'' is commonly used to request information — ''When are you going to work?'' — but sometimes not; ''see'' [[rhetorical question]].
*An ''exclamatory sentence'' or ''[[exclamation]]'' is generally a more emphatic form of statement: ''What a wonderful day this is!''
===Major and minor sentences===
A major sentence is a ''regular'' sentence; it has a [[subject (grammar)|subject]] and a [[predicate (grammar)|predicate]].
For example: ''I have a ball.'' In this sentence one can change the persons: ''We have a ball.'' However, a minor sentence is an irregular type of sentence. It does not contain a finite verb. For example, "Mary!" "Yes." "Coffee." etc. Other examples of minor sentences are headings (e.g. the heading of this entry), stereotyped expressions (''Hello!''), emotional expressions (''Wow!''), proverbs, etc. This can also include sentences which do not contain verbs (e.g. ''The more, the merrier.'') in order to intensify the meaning around the nouns (normally found in poetry and catchphrases)
by Judee N..
Computer software
'''Computer software,''' or just '''software''' is a general term used to describe a collection of [[computer program]]s, [[procedures]] and documentation that perform some tasks on a computer system.
The term includes [[application software]] such as [[word processor]]s which perform productive tasks for users, [[system software]] such as [[operating system]]s, which interface with [[hardware]] to provide the necessary services for application software, and [[middleware]] which controls and co-ordinates [[Distributed computing|distributed systems]].
"Software" is sometimes used in a broader context to mean anything which is not hardware but which is ''used'' with hardware, such as film, tapes and records.
==Relationship to computer hardware==
[[Computer]] software is so called to distinguish it from [[computer hardware]], which encompasses the physical interconnections and devices required to store and execute (or run) the software. At the lowest level, software consists of a [[machine language]] specific to an individual processor. A machine language consists of groups of binary values signifying processor instructions which change the state of the computer from its preceding state. Software is an ordered sequence of instructions for changing the state of the computer hardware in a particular sequence. It is usually written in [[high-level programming language]]s that are easier and more efficient for humans to use (closer to [[natural language]]) than machine language. High-level languages are [[compiler|compiled]] or [[interpreter (computing)|interpreted]] into machine language object code. Software may also be written in an [[assembly language]], essentially, a mnemonic representation of a machine language using a natural language alphabet. Assembly language must be assembled into object code via an [[assembly language#Assembler|assembler]].
The term "software" was first used in this sense by [[John W. Tukey]] in [[1958]]. In [[computer science]] and [[software engineering]], '''computer software''' is all computer programs. The theory that is the basis for most modern software was first proposed by [[Alan Turing]] in his [[1935]] essay ''Computable numbers with an application to the Entscheidungsproblem''.
==Types==
Practical [[computer system]]s divide [[software system]]s into three major classes: [[system software]], [[programming software]] and [[application software]], although the distinction is arbitrary, and often blurred.
*'''[[System software]]''' helps run the [[computer hardware]] and [[computer system]]. It includes [[operating system]]s, [[device driver]]s, diagnostic tools, [[Server (computing)|server]]s, [[windowing system]]s, [[software utility|utilities]] and more. The purpose of systems software is to insulate the applications programmer as much as possible from the details of the particular computer complex being used, especially memory and other hardware features, and such as accessory devices as communications, printers, readers, displays, keyboards, etc.
*'''[[Programming software]]''' usually provides tools to assist a [[programmer]] in writing [[computer program]]s, and software using different [[programming language]]s in a more convenient way. The tools include [[text editors]], [[compilers]], [[interpreter (computing)|interpreters]], [[linkers]], [[debuggers]], and so on. An [[Integrated development environment]] (IDE) merges those tools into a software bundle, and a programmer may not need to type multiple [[command]]s for compiling, interpreting, debugging, tracing, and etc., because the IDE usually has an advanced ''[[graphical user interface]],'' or GUI.
*'''[[Application software]]''' allows end users to accomplish one or more specific (non-computer related) [[task]]s. Typical applications include [[Industry|industrial]] [[automation]], [[business software]], [[educational software]], [[medical software]], [[database]]s, and [[computer games]]. Businesses are probably the biggest users of application software, but almost every field of human activity now uses some form of application software
==Program and library==
A [[Computer program|program]] may not be sufficiently complete for execution by a [[computer]]. In particular, it may require additional software from a [[software library]] in order to be complete. Such a library may include software components used by [[stand-alone]] programs, but which cannot work on their own. Thus, programs may include standard routines that are common to many programs, extracted from these libraries. Libraries may also ''include'' 'stand-alone' programs which are activated by some [[event-driven programming|computer event]] and/or perform some function (e.g., of computer 'housekeeping') but do not return data to their calling program. Libraries may be [[Execution (computers)|called]] by one to many other programs; programs may call zero to many other programs.
==Three layers==
Users often see things differently than programmers. People who use modern general purpose computers (as opposed to [[embedded system]]s, [[analog computer]]s, [[supercomputer]]s, etc.) usually see three layers of software performing a variety of tasks: platform, application, and user software.
;Platform software: [[Platform (computing)|Platform]] includes the [[firmware]], [[device driver]]s, an [[operating system]], and typically a [[graphical user interface]] which, in total, allow a user to interact with the computer and its [[peripheral]]s (associated equipment). Platform software often comes bundled with the computer. On a [[Personal computer|PC]] you will usually have the ability to change the platform software.
;Application software: [[Application software]] or Applications are what most people think of when they think of software. Typical examples include office suites and video games. Application software is often purchased separately from computer hardware. Sometimes applications are bundled with the computer, but that does not change the fact that they run as independent applications. Applications are almost always independent programs from the operating system, though they are often tailored for specific platforms. Most users think of compilers, databases, and other "system software" as applications.
;User-written software: [[End-user development]] tailors systems to meet users' specific needs. User software include spreadsheet templates, word processor macros, scientific simulations, and scripts for graphics and animations. Even email filters are a kind of user software. Users create this software themselves and often overlook how important it is. Depending on how competently the user-written software has been integrated into purchased application packages, many users may not be aware of the distinction between the purchased packages, and what has been added by fellow co-workers.
==Creation==
==Operation==
Computer software has to be "loaded" into the [[computer storage|computer's storage]] (such as a ''[[hard drive]]'', ''memory'', or ''[[RAM]]''). Once the software has loaded, the computer is able to ''execute'' the software. This involves passing [[instruction (computer science)|instructions]] from the application software, through the system software, to the [[hardware]] which ultimately receives the instruction as [[machine language|machine code]]. Each instruction causes the computer to carry out an operation -- moving [[data (computing)|data]], carrying out a [[computation]], or altering the [[control flow]] of instructions.
Data movement is typically from one place in memory to another. Sometimes it involves moving data between memory and registers which enable high-speed data access in the CPU. Moving data, especially large amounts of it, can be costly. So, this is sometimes avoided by using "pointers" to data instead. Computations include simple operations such as incrementing the value of a variable data element. More complex computations may involve many operations and data elements together.
Instructions may be performed sequentially, conditionally, or iteratively. Sequential instructions are those operations that are performed one after another. Conditional instructions are performed such that different sets of instructions execute depending on the value(s) of some data. In some languages this is known as an "if" statement. Iterative instructions are performed repetitively and may depend on some data value. This is sometimes called a "loop." Often, one instruction may "call" another set of instructions that are defined in some other program or [[module (programming)|module]]. When more than one computer processor is used, instructions may be executed simultaneously.
A simple example of the way software operates is what happens when a user selects an entry such as "Copy" from a menu. In this case, a conditional instruction is executed to copy text from data in a 'document' area residing in memory, perhaps to an intermediate storage area known as a 'clipboard' data area. If a different menu entry such as "Paste" is chosen, the software may execute the instructions to copy the text from the clipboard data area to a specific location in the same or another document in memory.
Depending on the application, even the example above could become complicated. The field of [[software engineering]] endeavors to manage the complexity of how software operates. This is especially true for software that operates in the context of a large or powerful [[computer system]].
Currently, almost the only limitations on the use of computer software in applications is the ingenuity of the designer/programmer. Consequently, large areas of activities (such as playing grand master level chess) formerly assumed to be incapable of software simulation are now routinely programmed. The only area that has so far proved reasonably secure from software simulation is the realm of human art— especially, pleasing music and literature.
Kinds of software by operation: [[computer program]] as [[executable]], [[source code]] or [[script (computer programming)|script]], [[computer configuration|configuration]].
==Quality and reliability==
[[Software reliability]] considers the errors, faults, and failures related to the design, implementation and operation of software.
'''See''' [[Computer security audit|Software auditing]], [[Software quality]], [[Software testing]], and [[Software reliability]].
==License==
[[Software license]] gives the user the right to use the software in the licensed environment, some software comes with the license when purchased off the shelf, or an OEM license when bundled with hardware. Other software comes with a [[free software licence]], granting the recipient the rights to modify and redistribute the software. Software can also be in the form of [[freeware]] or [[shareware]]. See also [[License Management]].
==Patents==
The issue of [[software patent]]s is controversial. Some believe that they hinder [[software development]], while others argue that software patents provide an important incentive to spur software innovation. See [[software patent debate]].
==Ethics and rights for software users==
Being a new part of society, the idea of what rights users of software should have is not very developed. Some, such as the [[free software community]], believe that software users should be free to modify and redistribute the software they use. They argue that these rights are necessary so that each individual can control their computer, and so that everyone can cooperate, if they choose, to work together as a community and control the direction that software progresses in. Others believe that software authors should have the power to say what rights the user will get.
==Software companies and non-profit organizations==
Examples of non-profit software organizations : [[Free Software Foundation]], [[GNU Project]], [[Mozilla Foundation]]
Examples of large software companies are: [[Microsoft]], [[IBM]], [[Oracle_Corporation|Oracle]], [[SAP AG|SAP]] and [[HP]].
Spanish language
'''Spanish''' or '''Castilian''' (''castellano'') is an [[Indo-European]], [[Romance languages|Romance language]] that originated in northern [[Spain]], and gradually spread in the [[Kingdom of Castile]] and evolved into the principal language of government and trade. It was taken to [[Spanish Empire#Territories in Africa (1898–1975)|Africa]], the [[Spanish colonization of the Americas|Americas]], and [[Spanish East Indies|Asia Pacific]] with the expansion of the [[Spanish Empire]] between the fifteenth and nineteenth centuries.
Today, between 322 and 400 million people speak Spanish as a native language, making it the world's second most-spoken language by native speakers (after [[Standard Mandarin|Mandarin Chinese]]).
==Hispanosphere==
It is estimated that the combined total of native and non-native Spanish speakers is approximately 500 million, likely making it the third most spoken language by total number of speakers (after [[English_language|English]] and [[Chinese_language|Chinese]]).
Today, Spanish is an official language of Spain, most [[Latin American]] countries, and [[Equatorial Guinea]]; 21 nations speak it as their primary language. Spanish also is one of [[United Nations#Languages|six official languages]] of the [[United Nations]]. [[Mexico]] has the world's largest Spanish-speaking population, and Spanish is the second most-widely spoken language in the [[United States]] and the most popular studied foreign language in [[United States|U.S.]] schools and universities. [[Global internet usage]] statistics for 2007 show Spanish as the third most commonly used language on the Internet, after English and [[Chinese language|Chinese]].
==Naming and origin==
Spaniards tend to call this language {{lang|es|'''''español'''''}} (Spanish) when contrasting it with languages of other states, such as [[French language|French]] and [[English language|English]], but call it {{lang|es|'''''castellano'''''}} (Castilian), that is, the language of the [[Castile (historical region)|Castile]] region, when contrasting it with other [[languages of Spain|languages spoken in Spain]] such as [[Galician language|Galician]], [[Basque language|Basque]], and [[Catalan language|Catalan]]. This reasoning also holds true for the language's preferred name in some [[Hispanic America]]n countries. In this manner, the [[Spanish Constitution of 1978]] uses the term {{lang|es|''castellano''}} to define the [[official language]] of the whole Spanish State, as opposed to {{lang|es|''las demás lenguas españolas''}} (lit. ''the other Spanish languages''). Article III reads as follows:
The name ''castellano'' is, however, widely used for the language as a whole in Latin America. Some Spanish speakers consider ''{{lang|es|castellano}}'' a generic term with no political or ideological links, much as "Spanish" is in English. Often Latin Americans use it to differentiate their own variety of Spanish as opposed to the variety of Spanish spoken in Spain, or variety of Spanish which is considered as standard in the region.
==Classification and related languages==
Spanish is closely related to the other [[West Iberian languages|West Iberian]] Romance languages: [[Asturian language|Asturian]] ({{lang|ast|''asturianu''}}), [[Galician language|Galician]] ({{lang|gl|''galego''}}), [[Ladino language|Ladino]] ({{lang|lad|''dzhudezmo/spanyol/kasteyano''}}), and [[Portuguese language|Portuguese]] ({{lang|pt|''português''}}). Catalan, an [[Iberian Romance languages|East Iberian language]] which exhibits many [[Gallo-Romance]] traits, is more similar to the neighbouring [[Occitan language]] ({{lang|oc|''occitan''}}) than to Spanish, or indeed than Spanish and Portuguese are to each other.
Spanish and Portuguese share similar grammars and vocabulary as well as a common history of [[Influence of Arabic on other languages|Arabic influence]] while a great part of the peninsula was under [[Timeline of the Muslim presence in the Iberian peninsula|Islamic rule]] (both languages expanded over [[Islamic empire|Islamic territories]]). Their [[lexical similarity]] has been estimated as 89%. See [[Differences between Spanish and Portuguese]] for further information.
===Ladino===
Ladino, which is essentially medieval Spanish and closer to modern Spanish than any other language, is spoken by many descendants of the [[Sephardi Jews]] who were [[Alhambra decree|expelled from Spain in the 15th century]]. Ladino speakers are currently almost exclusively [[Sephardim|Sephardi]] Jews, with family roots in Turkey, Greece or the Balkans: current speakers mostly live in Israel and Turkey, with a few pockets in Latin America. It lacks the [[Amerindian languages|Native American vocabulary]] which was influential during the [[Spanish Empire|Spanish colonial period]], and it retains many archaic features which have since been lost in standard Spanish. It contains, however, other vocabulary which is not found in standard Castilian, including vocabulary from [[Hebrew language|Hebrew]], some French, Greek and [[Turkish language|Turkish]], and other languages spoken where the Sephardim settled.
Ladino is in serious danger of extinction because many native speakers today are elderly as well as elderly ''olim'' (immigrants to [[Israel]]) who have not transmitted the language to their children or grandchildren. However, it is experiencing a minor revival among Sephardi communities, especially in music. In the case of the Latin American communities, the danger of extinction is also due to the risk of assimilation by modern Castilian.
A related dialect is [[Haketia]], the Judaeo-Spanish of northern Morocco. This too tended to assimilate with modern Spanish, during the Spanish occupation of the region.
===Vocabulary comparison===
Spanish and [[Italian language|Italian]] share a very similar phonological system. At present, the [[lexical similarity]] with Italian is estimated at 82%. As a result, Spanish and Italian are mutually intelligible to various degrees. The lexical similarity with [[Portuguese language|Portuguese]] is greater, 89%, but the vagaries of Portuguese pronunciation make it less easily understood by Hispanophones than Italian. [[Mutual intelligibility]] between Spanish and [[French language|French]] or [[Romanian language|Romanian]] is even lower (lexical similarity being respectively 75% and 71%): comprehension of Spanish by French speakers who have not studied the language is as low as an estimated 45% - the same as of English. The common features of the writing systems of the Romance languages allow for a greater amount of interlingual reading comprehension than oral communication would.
1. also {{lang|pt|''nós outros''}} in early modern Portuguese (e.g. ''[[The Lusiads]]'')
2. {{lang|it|''noi '''altri'''''}} in Southern [[List of languages of Italy|Italian dialects and languages]]
3. Alternatively {{lang|fr|''nous '''autres'''''}}
==History==
Spanish evolved from [[Vulgar Latin]], with major [[Arabic influence on the Spanish language|influences from Arabic]] in vocabulary during the [[Al-Andalus|Andalusian]] period and minor surviving influences from [[Basque language|Basque]] and [[Celtiberian language|Celtiberian]], as well as [[Germanic languages]] via the [[Visigoths]]. Spanish developed along the remote cross road strips among the [[Alava]], [[Cantabria]], [[Burgos]], [[Soria]] and [[La Rioja (autonomous community)|La Rioja]] provinces of Northern Spain, as a strongly innovative and differing variant from its nearest cousin, [[Asturian|Leonese speech]], with a higher degree of Basque influence in these regions (see [[Iberian Romance languages]]). Typical features of Spanish diachronical [[phonology]] include [[lenition]] (Latin {{lang|la|''vita''}}, Spanish {{lang|es|''vida''}}), [[palatalization]] (Latin {{lang|la|''annum''}}, Spanish {{lang|es|''año''}}, and Latin {{lang|la|''anellum''}}, Spanish {{lang|es|''anillo''}}) and [[diphthong]]ation ([[stem (linguistics)|stem]]-changing) of short ''e'' and ''o'' from Vulgar Latin (Latin {{lang|la|''terra''}}, Spanish {{lang|es|''tierra''}}; Latin {{lang|la|''novus''}}, Spanish {{lang|es|''nuevo''}}). Similar phenomena can be found in other Romance languages as well.
During the {{lang|es|''[[Reconquista]]''}}, this northern dialect from [[Cantabria]] was carried south, and remains a [[minority language]] in the northern coastal [[Morocco]].
The first Latin-to-Spanish grammar ({{lang|es|''Gramática de la Lengua Castellana''}}) was written in [[Salamanca]], Spain, in 1492, by [[Antonio de Nebrija|Elio Antonio de Nebrija]]. When it was presented to [[Isabel de Castilla]], she asked, "What do I want a work like this for, if I already know the language?", to which he replied, "Your highness, the language is the instrument of the Empire."
From the 16th century onwards, the language was taken to the [[Americas]] and the [[Spanish East Indies]] via [[Spanish colonization of the Americas|Spanish colonization]].
In the 20th century, Spanish was introduced to [[Equatorial Guinea]] and the [[Western Sahara]], the United States, such as in [[Spanish Harlem]], in [[New York City]], that had not been part of the Spanish Empire. For details on borrowed words and other external influences upon Spanish, see [[Influences on the Spanish language]].
===Characterization===
A defining characteristic of Spanish was the [[diphthong]]ization of the Latin short vowels ''e'' and ''o'' into ''ie'' and ''ue'', respectively, when they were stressed. Similar [[sound law|sound changes]] are found in other Romance languages, but in Spanish they were significant. Some examples:
* Lat. {{lang|la|''petra''}} > Sp. {{lang|es|''piedra''}}, It. {{lang|it|''pietra''}}, Fr. {{lang|fr|''pierre''}}, Rom. {{lang|ro|''piatrǎ''}}, Port./Gal. {{lang|pt|''pedra''}} "stone".
* Lat. {{lang|la|''moritur''}} > Sp. {{lang|es|''muere''}}, It. {{lang|it|''muore''}}, Fr. {{lang|fr|''meurt''}} / {{lang|fr|''muert''}}, Rom. {{lang|ro|''moare''}}, Port./Gal. {{lang|pt|''morre''}} "die".
Peculiar to early Spanish (as in the [[Gascon]] dialect of Occitan, and possibly due to a Basque [[substratum]]) was the mutation of Latin initial ''f-'' into ''h-'' whenever it was followed by a vowel that did not diphthongate. Compare for instance:
* Lat. {{lang|la|''filium''}} > It. {{lang|it|''figlio''}}, Port. {{lang|pt|''filho''}}, Gal. {{lang|gl|''fillo''}}, Fr. {{lang|fr|''fils''}}, Occitan {{lang|oc|''filh''}} (but Gascon {{lang|gsc|''hilh''}}) Sp. {{lang|es|''hijo''}} (but Ladino {{lang|lad|''fijo''}});
* Lat. {{lang|la|''fabulari''}} > Lad. {{lang|lad|''favlar''}}, Port./Gal. {{lang|pt|''falar''}}, Sp. {{lang|es|''hablar''}};
* but Lat. {{lang|la|''focum''}} > It. {{lang|it|''fuoco''}}, Port./Gal. {{lang|pt|''fogo''}}, Sp./Lad. {{lang|es|''fuego''}}.
Some [[consonant cluster]]s of Latin also produced characteristically different results in these languages, for example:
* Lat. {{lang|la|''clamare''}}, acc. {{lang|la|''flammam''}}, {{lang|la|''plenum''}} > Lad. {{lang|lad|''lyamar''}}, {{lang|lad|''flama''}}, {{lang|lad|''pleno''}}; Sp. {{lang|es|''llamar''}}, {{lang|es|''llama''}}, {{lang|es|''lleno''}}. However, in Spanish there are also the forms {{lang|la|''clamar''}}, {{lang|lad|''flama''}}, {{lang|lad|''pleno''}}; Port. {{lang|pt|''chamar''}}, {{lang|pt|''chama''}}, {{lang|pt|''cheio''}}; Gal. {{lang|gl|''chamar''}}, {{lang|gl|''chama''}}, {{lang|gl|''cheo''}}.
* Lat. acc. {{lang|la|''octo''}}, {{lang|la|''noctem''}}, {{lang|la|''multum''}} > Lad. {{lang|lad|''ocho''}}, {{lang|lad|''noche''}}, {{lang|lad|''muncho''}}; Sp. {{lang|es|''ocho''}}, {{lang|es|''noche''}}, {{lang|es|''mucho''}}; Port. {{lang|pt|''oito''}}, {{lang|pt|''noite''}}, {{lang|pt|''muito''}}; Gal. {{lang|gl|''oito''}}, {{lang|gl|''noite''}}, {{lang|gl|''moito''}}.
==Geographic distribution==
Spanish is one of the official languages of the [[European Union]], the [[Organization of American States]], the [[Organization of Ibero-American States]], the [[United Nations]], and the [[Union of South American Nations]].
===Europe===
Spanish is an official language of Spain, the country for which it is named and from which it originated. It is also spoken in [[Gibraltar]], though English is the official language. Likewise, it is spoken in [[Andorra]] though [[Catalan language|Catalan]] is the official language. It is also spoken by small communities in other European countries, such as the [[United Kingdom]], [[France]], and [[Germany]]. Spanish is an official language of the [[European Union]]. In Switzerland, Spanish is the [[mother tongue]] of 1.7% of the population, representing the first minority after the 4 official languages of the country.
===The Americas ===
====Latin America====
Most Spanish speakers are in [[Latin America]]; of most countries with the most Spanish speakers, only [[Spain]] is outside of the [[Americas]]. [[Mexico]] has most of the world's native speakers. Nationally, Spanish is the official language of [[Argentina]], [[Bolivia]] (co-official [[Quechua]] and [[Aymara language|Aymara]]), [[Chile]], [[Colombia]], [[Costa Rica]], [[Cuba]], [[Dominican Republic]], [[Ecuador]], [[El Salvador]], [[Guatemala]], [[Honduras]], [[Mexico]] , [[Nicaragua]], [[Panama]], [[Paraguay]] (co-official [[Guarani language|Guaraní]]), [[Peru]] (co-official [[Quechua]] and, in some regions, [[Aymara language|Aymara]]), [[Uruguay]], and [[Venezuela]]. Spanish is also the official language (co-official with [[English language|English]]) in the U.S. commonwealth of [[Puerto Rico]].
Spanish has no official recognition in the former [[British overseas territories|British colony]] of [[Belize]]; however, per the 2000 census, it is spoken by 43% of the population. Mainly, it is spoken by Hispanic descendants who remained in the region since the 17th century; however, English is the official language.
Spain colonized [[Trinidad and Tobago]] first in [[1498]], leaving the [[Carib]] people the Spanish language. Also the [[Cocoa Panyol]]s, laborers from Venezuela, took their culture and language with them; they are accredited with the music of "[[Parang]]" ("[[Parranda]]") on the island. Because of Trinidad's location on the South American coast, the country is much influenced by its Spanish-speaking neighbors. A recent census shows that more than 1,500 inhabitants speak Spanish. In 2004, the government launched the ''Spanish as a First Foreign Language'' (SAFFL) initiative in March 2005. Government regulations require Spanish to be taught, beginning in primary school, while thirty percent of public employees are to be linguistically competent within five years. The government also announced that Spanish will be the country's second official language by [[2020]], beside English.
Spanish is important in [[Brazil]] because of its proximity to and increased trade with its Spanish-speaking neighbors; for example, as a member of the [[Mercosur]] trading bloc. In 2005, the [[National Congress of Brazil]] approved a bill, signed into law by the [[President of Brazil|President]], making Spanish available as a foreign language in secondary schools. In many border towns and villages (especially on the Uruguayan-Brazilian border), a [[mixed language]] known as [[Riverense Portuñol|Portuñol]] is spoken.
====United States====
In the 2006 census, 44.3 million people of the U.S. population were [[Hispanic]] or [[Latino]] by origin; 34 million people, 12.2 percent, of the population older than 5 years speak Spanish at home.Spanish has a [[Spanish in the United States|long history in the United States]] (many south-western states were part of Mexico and Spain), and it recently has been revitalized by much immigration from Latin America. Spanish is the most widely taught foreign language in the country. Although the United States has no formally designated "official languages," Spanish is formally recognized at the state level beside English; in the U.S. state of [[New Mexico]], 30 per cent of the population speak it. It also has strong influence in metropolitan areas such as Los Angeles, Miami and New York City. Spanish is the dominant spoken language in [[Puerto Rico]], a U.S. territory. In total, the U.S. has the world's fifth-largest Spanish-speaking population.
===Asia===
Spanish was an official language of the [[Philippines]] but was never spoken by a majority of the population. Movements for most of the masses to learn the language were started but were stopped by the friars. Its importance fell in the first half of the 20th century following the U.S. occupation and administration of the islands. The introduction of the English language in the Philippine government system put an end to the use of Spanish as the official language. The language lost its official status in 1973 during the [[Ferdinand Marcos]] administration.
Spanish is spoken mainly by small communities of Filipino-born Spaniards, Latin Americans, and Filipino [[mestizo]]s (mixed race), descendants of the early colonial Spanish settlers. Throughout the 20th century, the Spanish language has declined in importance compared to English and [[Tagalog language|Tagalog]]. According to the 1990 Philippine census, there were 2,658 native speakers of Spanish. No figures were provided during the 1995 and 2000 censuses; however, figures for 2000 did specify there were over 600,000 native speakers of [[Chavacano language|Chavacano]], a Spanish based [[Creole language|creole]] language spoken in [[Cavite]] and [[Zamboanga]]. Some other sources put the number of Spanish speakers in the Philippines around two to three million; however, these sources are disputed. In Tagalog, there are 4,000 Spanish adopted words and around 6,000 Spanish adopted words in Visayan and other Philippine languages as well. Today Spanish is offered as a foreign language in Philippines schools and universities.
===Africa===
In Africa, Spanish is official in the UN-recognised but Moroccan-occupied [[Western Sahara]] (co-official [[Arabic language|Arabic]]) and [[Equatorial Guinea]] (co-official [[French language|French]] and [[Portuguese language|Portuguese]]). Today, nearly 200,000 refugee Sahrawis are able to read and write in Spanish, and several thousands have received [[university]] education in foreign countries as part of aid packages (mainly [[Cuba]] and [[Spain]]). In Equatorial Guinea, Spanish is the predominant language when counting native and non-native speakers (around 500,000 people), while [[Fang language|Fang]] is the most spoken language by a number of native speakers. It is also spoken in the Spanish cities in [[Plazas de soberanía|continental North Africa]] ([[Ceuta]] and [[Melilla]]) and in the autonomous community of [[Canary Islands]] (143,000 and 1,995,833 people, respectively). Within Northern Morocco, a former [[History of Morocco#European influence|Franco-Spanish protectorate]] that is also geographically close to Spain, approximately 20,000 people speak Spanish. It is spoken by some communities of [[Angola]], because of the Cuban influence from the [[Cold War]], and in [[Nigeria]] by the descendants of [[Afro-Cuban]] ex-slaves. In [[Côte d'Ivoire]] and [[Senegal]], Spanish can be learned as a second foreign language in the public education system. In 2008, [[Cervantes Institute]]s centers will be opened in [[Lagos]] and [[Johannesburg]], the first one in the [[Sub-Saharan Africa]]
===Oceania===
Among the countries and territories in [[Oceania]], Spanish is also spoken in [[Easter Island]], a territorial possession of Chile. According to the 2001 census, there are approximately 95,000 speakers of Spanish in Australia, 44,000 of which live in Greater Sydney , where the older [[:Category: Australians of Mexican descent|Mexican]], [[:Category:Australians of Colombian descent|Colombian]], and [[:Category: Australians of Spanish descent|Spanish]] populations and newer [[:Category:Australians of Argentine descent|Argentine]], Salvadoran and [[:Category:Australians of Uruguayan descent|Uruguyan]] communities live.
The island nations of [[Guam]], [[Palau]], [[Northern Marianas]], [[Marshall Islands]] and [[Federated States of Micronesia]] all once had Spanish speakers, since [[Marianas Islands|Marianas]] and [[Caroline Islands]] were Spanish colonial possessions until late 19th century (see [[Spanish-American War]]), but Spanish has since been forgotten. It now only exists as an influence on the local native languages and also spoken by [[Hispanics in the United States|Hispanic American]] resident populations.
==Dialectal variation==
There are important variations among the regions of Spain and throughout Spanish-speaking America. In countries in Hispanophone America, it is preferable to use the word ''castellano'' to distinguish their version of the language from that of Spain, thus asserting their autonomy and national identity. In Spain the Castilian dialect's pronunciation is commonly regarded as the national standard, although a use of slightly different pronouns called [[Loísmo|{{lang|es|''laísmo''}}]] of this dialect is deprecated. More accurately, for nearly everyone in Spain, "standard Spanish" means "pronouncing everything exactly as it is written," an ideal which does not correspond to any real dialect, though the northern dialects are the closest to it. In practice, the standard way of speaking Spanish in the media is "written Spanish" for formal speech, "Madrid dialect" (one of the transitional variants between Castilian and Andalusian) for informal speech.
===Voseo===
Spanish has three [[grammatical person|second-person]] [[grammatical number|singular]] [[pronoun]]s: {{lang|es|''tú''}}, {{lang|es|''usted''}}, and in some parts of Latin America, {{lang|es|''vos''}} (the use of this pronoun and/or its verb forms is called ''voseo''). In those regions where it is used, generally speaking, {{lang|es|''tú''}} and {{lang|es|''vos''}} are informal and used with friends; in other countries, {{lang|es|''vos''}} is considered an archaic form. {{lang|es|''Usted''}} is universally regarded as the formal address (derived from {{lang|es|''vuestra merced''}}, "your grace"), and is used as a mark of respect, as when addressing one's elders or strangers.
{{lang|es|''Vos''}} is used extensively as the primary spoken form of the second-person singular pronoun, although with wide differences in social consideration, in many countries of [[Latin America]], including [[Argentina]], [[Chile]], [[Costa Rica]], the central mountain region of [[Ecuador]], the State of [[Chiapas]] in [[Mexico]], [[El Salvador]], [[Guatemala]], [[Honduras]], [[Nicaragua]], [[Paraguay]], [[Uruguay]], the [[Paisa region]] and Caleños of [[Colombia]] and the [[States]] of [[Zulia]] and Trujillo in [[Venezuela]]. There are some differences in the verbal endings for ''vos'' in each country. In Argentina, Uruguay, and increasingly in Paraguay and some Central American countries, it is also the standard form used in the [[mass media|media]], but the media in other countries with {{lang|es|''voseo''}} generally continue to use {{lang|es|''usted''}} or {{lang|es|''tú''}} except in advertisements, for instance. {{lang|es|''Vos''}} may also be used regionally in other countries. Depending on country or region, usage may be considered standard or (by better educated speakers) to be unrefined. Interpersonal situations in which the use of ''vos'' is acceptable may also differ considerably between regions.
===Ustedes===
Spanish forms also differ regarding second-person plural pronouns. The Spanish dialects of Latin America have only one form of the second-person plural for daily use, {{lang|es|''ustedes''}} (formal or familiar, as the case may be, though {{lang|es|''vosotros''}} non-formal usage can sometimes appear in poetry and rhetorical or literary style). In Spain there are two forms — {{lang|es|''ustedes''}} (formal) and {{lang|es|''vosotros''}} (familiar). The pronoun {{lang|es|''vosotros''}} is the plural form of {{lang|es|''tú''}} in most of Spain, but in the Americas (and certain southern Spanish cities such as [[Cádiz]] or [[Seville]], and in the [[Canary Islands]]) it is replaced with {{lang|es|''ustedes''}}. It is notable that the use of {{lang|es|''ustedes''}} for the informal plural "you" in southern Spain does not follow the usual rule for pronoun-verb [[agreement (linguistics)|agreement]]; e.g., while the formal form for "you go", {{lang|es|''ustedes van''}}, uses the third-person plural form of the verb, in Cádiz or Seville the informal form is constructed as {{lang|es|''ustedes vais''}}, using the second-person plural of the verb. In the Canary Islands, though, the usual pronoun-verb agreement is preserved in most cases.
Some words can be different, even embarrassingly so, in different Hispanophone countries. Most Spanish speakers can recognize other Spanish forms, even in places where they are not commonly used, but Spaniards generally do not recognise specifically American usages. For example, Spanish ''mantequilla'', ''aguacate'' and ''albaricoque'' (respectively, "butter", "avocado", "apricot") correspond to ''manteca'', ''palta'', and ''damasco'', respectively, in Argentina, Chile and Uruguay. The everyday Spanish words ''coger'' (to catch, get, or pick up), ''pisar'' (to step on) and ''concha'' (seashell) are considered extremely rude in parts of Latin America, where the meaning of ''coger'' and ''pisar'' is also "to have sex" and ''concha'' means "vulva". The Puerto Rican word for "bobby pin" (''pinche'') is an obscenity in Mexico, and in [[Nicaragua]] simply means "stingy". Other examples include ''[[taco]]'', which means "swearword" in Spain but is known to the rest of the world as a Mexican dish. ''Pija'' in many countries of Latin America is an obscene slang word for "penis", while in [[Spain]] the word also signifies "posh girl" or "snobby". ''Coche'', which means "car" in Spain, for the vast majority of Spanish-speakers actually means "baby-stroller", in Guatemala it means "pig", while ''carro'' means "car" in some Latin American countries and "cart" in others, as well as in Spain.
The {{lang|es|[[Real Academia Española]]}} (Royal Spanish Academy), together with the 21 other national ones (see [[Association of Spanish Language Academies]]), exercises a standardizing influence through its publication of dictionaries and widely respected grammar and style guides. Due to this influence and for other sociohistorical reasons, a standardized form of the language ([[Standard Spanish]]) is widely acknowledged for use in literature, academic contexts and the media.
==Writing system==
Spanish is written using the [[Latin alphabet]], with the addition of the character ''[[ñ]]'' (''eñe'', representing the phoneme {{IPA|/ɲ/}}, a letter distinct from ''n'', although typographically composed of an ''n'' with a [[tilde]]) and the [[digraph (orthography)|digraph]]s ''ch'' ({{lang|es|''che''}}, representing the phoneme {{IPA|/tʃ/}}) and ''ll'' ({{lang|es|''elle''}}, representing the phoneme {{IPA|/ʎ/}}). However, the digraph ''rr'' ({{lang|es|''erre fuerte''}}, "strong ''r''", {{lang|es|''erre doble''}}, "double ''r''", or simply {{lang|es|''erre''}}), which also represents a distinct phoneme {{IPA|/r/}}, is not similarly regarded as a single letter. Since 1994, the digraphs ''ch'' and ''ll'' are to be treated as letter pairs for [[collation]] purposes, though they remain a part of the alphabet. Words with ''ch'' are now alphabetically sorted between those with ''ce'' and ''ci'', instead of following ''cz'' as they used to, and similarly for ''ll''.
Thus, the Spanish alphabet has the following 29 letters:
:a, b, c, ch, d, e, f, g, h, i, j, k, l, ll, m, n, ñ, o, p, q, r, s, t, u, v, w, x, y, z.
With the exclusion of a very small number of regional terms such as ''México'' (see [[Toponymy of Mexico]]) and some neologisms like ''software'', pronunciation can be entirely determined from spelling. A typical Spanish word is stressed on the [[syllable]] before the last if it ends with a vowel (not including ''y'') or with a vowel followed by ''n'' or ''s''; it is stressed on the last syllable otherwise. Exceptions to this rule are indicated by placing an [[acute accent]] on the [[stress (linguistics)|stressed vowel]].
The acute accent is used, in addition, to distinguish between certain [[homophone]]s, especially when one of them is a stressed word and the other one is a [[clitic]]: compare {{lang|es|''el''}} ("the", masculine singular definite article) with {{lang|es|''él''}} ("he" or "it"), or {{lang|es|''te''}} ("you", object pronoun), {{lang|es|''de''}} (preposition "of" or "from"), and {{lang|es|''se''}} (reflexive pronoun) with {{lang|es|''té''}} ("tea"), {{lang|es|''dé''}} ("give") and {{lang|es|''sé''}} ("I know", or imperative "be").
The interrogative pronouns ({{lang|es|''qué''}}, {{lang|es|''cuál''}}, {{lang|es|''dónde''}}, {{lang|es|''quién''}}, etc.) also receive accents in direct or indirect questions, and some demonstratives ({{lang|es|''ése''}}, {{lang|es|''éste''}}, {{lang|es|''aquél''}}, etc.) must be accented when used as pronouns. The conjunction {{lang|es|''o''}} ("or") is written with an accent between numerals so as not to be confused with a zero: e.g., {{lang|es|''10 ó 20''}} should be read as {{lang|es|''diez o veinte''}} rather than {{lang|es|''diez mil veinte''}} ("10,020"). Accent marks are frequently omitted in capital letters (a widespread practice in the early days of computers where only lowercase vowels were available with accents), although the [[Real Academia Española|RAE]] advises against this.
When ''u'' is written between ''g'' and a front vowel (''e'' or ''i''), if it should be pronounced, it is written with a [[diaeresis (diacritic)|diaeresis]] (''ü'') to indicate that it is not silent as it normally would be (e.g., ''cigüeña'', "stork", is pronounced {{IPA|/θiˈɣweɲa/}}; if it were written ''cigueña'', it would be pronounced {{IPA|/θiˈɣeɲa/}}.
Interrogative and exclamatory clauses are introduced with [[Inverted question and exclamation marks|inverted question ( ¿ ) and exclamation ( ¡ ) marks]].
==Sounds==
The phonemic inventory listed in the following table includes [[phoneme]]s that are preserved only in some dialects, other dialects having merged them (such as ''[[yeísmo]]''); these are marked with an asterisk (*). Sounds in parentheses are [[allophone]]s.
By the 16th century, the consonant system of Spanish underwent the following important changes that differentiated it from [[Iberian Romance languages|neighboring Romance languages]] such as [[Portuguese language|Portuguese]] and [[Catalan language|Catalan]]:
*Initial {{IPA|/f/}}, when it had evolved into a vacillating {{IPA|/h/}}, was lost in most words (although this etymological ''h-'' is preserved in spelling and in some Andalusian dialects is still aspirated).
*The [[bilabial approximant]] {{IPA|/β̞/}} (which was written ''u'' or ''v'') merged with the bilabial oclusive {{IPA|/b/}} (written ''b''). There is no difference between the pronunciation of orthographic ''b'' and ''v'' in contemporary Spanish, excepting emphatic pronunciations that cannot be considered standard or natural.
*The [[voiced alveolar fricative]] {{IPA|/z/}} which existed as a separate phoneme in medieval Spanish merged with its voiceless counterpart {{IPA|/s/}}. The phoneme which resulted from this merger is currently spelled ''s''.
*The [[voiced postalveolar fricative]] {{IPA|/ʒ/}} merged with its voiceless counterpart {{IPA|/ʃ/}}, which evolved into the modern velar sound {{IPA|/x/}} by the 17th century, now written with ''j'', or ''g'' before ''e, i''. Nevertheless, in most parts of Argentina and in Uruguay, ''y'' and ''ll'' have both evolved to {{IPA|/ʒ/}} or {{IPA|/ʃ/}}.
*The [[voiced alveolar affricate]] {{IPA|/dz/}} merged with its voiceless counterpart {{IPA|/ts/}}, which then developed into the interdental {{IPA|/θ/}}, now written ''z'', or ''c'' before ''e, i''. But in [[Andalusia]], the [[Canary Islands]] and the Americas this sound merged with {{IPA|/s/}} as well. See ''[[Ceceo]]'', for further information.
The consonant system of Medieval Spanish has been better preserved in [[Ladino language|Ladino]] and in Portuguese, neither of which underwent these shifts.
===Lexical stress===
Spanish is a [[syllable-timed language]], so each syllable has the same duration regardless of stress. Stress most often occurs on any of the last three syllables of a word, with some rare exceptions at the fourth last. The ''tendencies'' of stress assignment are as follows:
* In words ending in vowels and {{IPA|/s/}}, stress most often falls on the penultimate syllable.
* In words ending in all other consonants, the stress more often falls on the ultimate syllable.
* Preantepenultimate stress occurs rarely and only in words like ''guardándoselos'' ('saving them for him/her') where a clitic follows certain verbal forms.
In addition to the many exceptions to these tendencies, there are numerous [[minimal pair]]s which contrast solely on stress. For example, ''sabana'', with penultimate stress, means 'savannah' while ''{{lang|es|sábana}}'', with antepenultimate stress, means 'sheet'; ''{{lang|es|límite}}'' ('boundary'), ''{{lang|es|limite}}'' ('[that] he/she limits') and ''{{lang|es|limité}}'' ('I limited') also contrast solely on stress.
Phonological stress may be marked orthographically with an [[acute accent]] (''ácido'', ''distinción'', etc). This is done according to the mandatory stress rules of [[Spanish orthography]] which are similar to the tendencies above (differing with words like ''distinción'') and are defined so as to unequivocally indicate where the stress lies in a given written word. An acute accent may also be used to differentiate homophones (such as ''[[wikt:té#Spanish|té]]'' for 'tea' and ''[[wikt:te#Spanish|te]]''
An amusing example of the significance of intonation in Spanish is the phrase ''{{lang|es|¿Cómo "cómo como"? ¡Como como como!}}'' ("What do you mean / 'how / do I eat'? / I eat / the way / I eat!").
==Grammar==
Spanish is a relatively [[inflected]] language, with a two-[[Grammatical gender|gender]] system and about fifty [[Grammatical conjugation|conjugated]] forms per [[verb]], but limited inflection of [[noun]]s, [[adjective]]s, and [[determiner]]s. (For a detailed overview of verbs, see [[Spanish verbs]] and [[Spanish irregular verbs]].)
It is [[Branching (linguistics)|right-branching]], uses [[preposition]]s, and usually, though not always, places [[adjective]]s after [[noun]]s. Its [[syntax]] is generally [[Subject Verb Object]], though variations are common. It is a [[pro-drop language]] (allows the deletion of pronouns when pragmatically unnecessary) and [[verb framing|verb-framed]].
== Samples ==
Speech recognition
'''Speech recognition''' (also known as '''automatic speech recognition''' or '''computer speech recognition''') converts spoken words to machine-readable input (for example, to keypresses, using the binary code for a string of [[Character (computing)|character]] codes). The term [[speaker recognition|voice recognition]] may also be used to refer to speech recognition, but more precisely refers to '''speaker recognition''', which attempts to identify the person speaking, as opposed to what is being said.
Speech recognition applications include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), [[domotic]] appliance control and content-based spoken audio search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g., [[word processor]]s or [[email]]s), and in aircraft [[cockpit]]s (usually termed [[Direct Voice Input]]).
==History==
One of the most notable domains for the commercial application of speech recognition in the United States has been health care and in particular the work of the [[medical transcription]]ist (MT). According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than make the transcription process more efficient, hence it was not accepted. It was also the case that SR at that time was often technically deficient. Additionally, to be used effectively, it required changes to the ways physicians worked and documented clinical encounters, which many if not all were reluctant to do. The biggest limitation to speech recognition automating transcription, however, is seen as the software. The nature of narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but not yet by an automated system. Another limitation has been the extensive amount of time required by the user and/or system provider to train the software.
A distinction in ASR is often made between "artificial syntax systems" which are usually domain-specific and "natural language processing" which is usually language-specific. Each of these types of application presents its own particular goals and challenges.
==Applications==
===Health care===
In the [[health care]] domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete. Many experts in the field anticipate that with increased use of speech recognition technology, the services provided may be redistributed rather than replaced.
Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document. It never goes through an MT/editor.
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report. Deferred SR is being widely used in the industry currently.
Many [[Electronic Medical Records]] (EMR) applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine. Searches, queries, and form filling may all be faster to perform by voice than by using a keyboard.
****************************************************************************************
**********************************
*****************
===Military===
====High-performance fighter aircraft====
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/[[F-16]] aircraft ([[F-16 VISTA]]), the program in France on installing speech recognition systems on [[Mirage (aircraft)|Mirage]] aircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.
Some important conclusions from the work were as follows:
#Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently.
#Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful — with lower recognition rates, pilots would not use the system.
#More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained.
Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft.
Working with Swedish pilots flying in the [[JAS-39]] Gripen cockpit, Englund (2004) found recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly. Contrary to what might be expected, no effects of the broken English of the speakers were found. It was evident that spontaneous speech caused problems for the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially.
The [[Eurofighter Typhoon]] currently in service with the UK [[RAF]] employs a speaker-dependent system, i.e. it requires each pilot to create a template. The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other [[cockpit]] functions. Voice commands are confirmed by visual and/or aural feedback. The system is seen as a major design feature in the reduction of pilot [[workload]], and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.
====Helicopters====
The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone. Substantial test and evaluation programs have been carried out in the post decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma helicopter. There has also been much useful work in Canada. Results have been encouraging, and voice applications have included: control of communication radios; setting of navigation systems; and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment. Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve performance improvements in operational settings.
====Battle management====
Battle management command centres generally require rapid access to and control of large, rapidly changing information databases. Commanders and system operators need to query these databases as conveniently as possible, in an eyes-busy environment where much of the information is presented in a display format. Human machine interaction by voice has the potential to be very useful in these environments. A number of efforts have been undertaken to interface commercially available isolated-word recognizers into battle management environments. In one feasibility study, speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications. Users were very optimistic about the potential of the system, although capabilities were limited.
Speech understanding programs sponsored by the Defense Advanced Research Projects Agency (DARPA) in the U.S. has focused on this problem of natural speech interface.. Speech recognition efforts have focused on a database of continuous speech recognition (CSR), large-vocabulary speech which is designed to be representative of the naval resource management task. Significant advances in the state-of-the-art in CSR have been achieved, and current efforts are focused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system.
====Training air traffic controllers====
Training for military (or civilian) air traffic controllers (ATC) represents an excellent application for speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation.
Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task.
The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system. However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications. The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation. Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed for the ATC task. Research in France has focussed on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task-domain grammar constraints.
The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech recognition provided by Adacel Systems Inc (ASI). Adacel's MaxSim software uses speech recognition and synthetic speech to enable the trainee to control aircraft and ground vehicles in the simulation without the need for pseudo pilots. Adacel's ATC In A Box Software provideds a synthetic ATC environment for flight simulators. The "real" pilot talks to a virtual controller using speech recognition and the virtual controller responds with synthetic speech. It will be an application format
===Telephony and other domains===
ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing, however, ASR in the field of document production has not seen the expected increases in use.
The improvement of mobile processor speeds let create speech-enabled Symbian and Windows Mobile Smartphones. Current speech-to-text programs are too large and require too much CPU power to be practical for the Pocket PC. Speech is used mostly as a part of User Interface, for creating pre-defined or custom speech commands. Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command);
Nuance Communications (Nuance Voice Control);
Vito Technology (VITO Voice2Go);
Speereo Software (Speereo Voice Translator).
===People with Disabilities===
People with disabilities are another part of the population that benefit from using speech recognition programs. It is especially useful for people who have difficulty with or are unable to use their hands, from mild repetitive stress injuries to involved disabilities that require alternative input for support with accessing the computer. In fact, people who used the keyboard a lot and developed [[Repetitive Strain Injury|RSI]] became an urgent early market for speech recognition. Speech recognition is used in [[deaf]] [[telephony]], such as [[spinvox]] voice-to-text voicemail, [[relay services]], and [[Telecommunications Relay Service#Captioned_telephone|captioned telephone]].
===Further applications===
*Automatic translation
*Automotive speech recognition (e.g., [[Ford Sync]])
*Telematics (e.g. vehicle Navigation Systems)
*Court reporting (Realtime Voice Writing)
*[[Hands-free computing]]: voice command recognition computer [[user interface]]
*[[Home automation]]
*[[Interactive voice response]]
*[[Mobile telephony]], including mobile email
*[[Multimodal interaction]]
*[[Pronunciation]] evaluation in computer-aided language learning applications
*[[Robotics]]
*[[Transcription (linguistics)|Transcription]] (digital speech-to-text).
*Speech-to-Text (Transcription of speech into mobile text messages)
==Performance of speech recognition systems==
The performance of speech recognition systems is usually specified in terms of accuracy and speed. Accuracy may be measured in terms of performance accuracy which is usually rated with [[word error rate]] (WER), whereas speed is measured with the [[real time factor]]. Other measures of accuracy include [[Single Word Error Rate]] (SWER) and [[Command Success Rate]] (CSR).
Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. There is some confusion, however, over the interchangeability of the terms "speech recognition" and "dictation".
Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `enrollment') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions. `Optimal conditions' usually assume that users:
* have speech characteristics which match the training data,
* can achieve proper speaker adaptation, and
* work in a clean noise environment (e.g. quiet office or laboratory space).
This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected. Speech recognition in video has become a popular search technology used by several video search companies.
Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.
Both [[Acoustic Model|acoustic modeling]] and [[language model]]ing are important parts of modern statistically-based speech recognition algorithms. Hidden Markov models (HMMs) are widely used in many systems. Language modeling has many other applications such as [[smart keyboard]] and [[document classification]].
===Hidden Markov model (HMM)-based speech recognition===
Modern general-purpose speech recognition systems are generally based on [[Hidden Markov Model|HMMs]]. These are statistical models which output a sequence of symbols or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a [[stationary process]]. Speech could thus be thought of as a [[Markov model]] for many stochastic processes.
Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of ''n''-dimensional real-valued vectors (with ''n'' being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of [[cepstrum|cepstral]] coefficients, which are obtained by taking a [[Fourier transform]] of a short time window of speech and decorrelating the spectrum using a [[cosine transform]], then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each [[phoneme]], will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum [[mutual information]] (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the [[Viterbi algorithm]] to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the [[finite state transducer]], or FST, approach).
===Dynamic time warping (DTW)-based speech recognition===
Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced
by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics – indeed, any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.
==Further information==
Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of [[Natural Language Processing]], such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the [[IEEE]] Transactions on Speech and Audio Processing (now named [[IEEE]] Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by [[Lawrence Rabiner]] can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). Even more up to date is "Computer Speech", by Manfred R. Schroeder, second edition published in 2004. A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by [[DARPA]] (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).
In terms of freely available resources, the [[HTK (software)|HTK]] book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting. Another such resource is [[Carnegie Mellon University]]'s SPHINX toolkit. The AT&T libraries [http://www.research.att.com/projects/mohri/fsm FSM Library], [http://www.research.att.com/projects/mohri/grm GRM library], and [http://www.cs.nyu.edu/~mohri DCD library] are also general software libraries for large-vocabulary speech recognition.
A useful review of the area of robustness in ASR is provided by Junqua and Haton (1995).
Speech synthesis
'''Speech synthesis''' is the artificial production of human [[Speech communication|speech]]. A computer system used for this purpose is called a '''speech synthesizer''', and can be implemented in [[software]] or [[Computer hardware|hardware]]. A '''text-to-speech (TTS)''' system converts normal language text into speech; other systems render [[symbolic linguistic representation]]s like [[phonetic transcription]]s into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a [[database]]. Systems differ in the size of the stored speech units; a system that stores [[phone]]s or [[diphone]]s provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the [[vocal tract]] and other human voice characteristics to create a completely "synthetic" voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with [[visual impairment]]s or [[reading disability|reading disabilities]] to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.
== Overview of text processing ==
A text-to-speech system (or "engine") is composed of two parts: a [[front-end]] and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called ''text normalization'', ''pre-processing'', or ''[[tokenization]]''. The front-end then assigns [[phonetic transcription]]s to each word, and divides and marks the text into [[prosody (linguistics)|prosodic units]], like [[phrase]]s, [[clause]]s, and [[sentence (linguistics)|sentence]]s. The process of assigning phonetic transcriptions to words is called ''text-to-phoneme'' or ''[[grapheme]]-to-phoneme'' conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the ''synthesizer''—then converts the symbolic linguistic representation into sound.
== History ==
Long before [[electronics|electronic]] [[signal processing]] was invented, there were those who tried to build machines to create human speech. Some early legends of the existence of [[Brazen Head|"speaking heads"]] involved [[Pope Silvester II|Gerbert of Aurillac]] (d. 1003 AD), [[Albertus Magnus]] (1198–1280), and [[Roger Bacon]] (1214–1294).
In 1779, the [[Denmark|Danish]] scientist Christian Kratzenstein, working at the [[Russian Academy of Sciences]], built models of the human [[vocal tract]] that could produce the five long [[vowel]] sounds (in [[help:IPA|International Phonetic Alphabet]] notation, they are {{IPA|[aː]}}, {{IPA|[eː]}}, {{IPA|[iː]}}, {{IPA|[oː]}} and {{IPA|[uː]}}). This was followed by the [[bellows]]-operated "acoustic-mechanical speech machine" by [[Wolfgang von Kempelen]] of [[Vienna]], [[Austria]], described in a 1791 paper. This machine added models of the tongue and lips, enabling it to produce [[consonant]]s as well as vowels. In 1837, [[Charles Wheatstone]] produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.
In the 1930s, [[Bell Labs]] developed the [[Vocoder|VOCODER]], a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. [[Homer Dudley]] refined this device into the VODER, which he exhibited at the [[1939 New York World's Fair]].
The [[Pattern playback]] was built by [[Franklin S. Cooper|Dr. Franklin S. Cooper]] and his colleagues at [[Haskins Laboratories]] in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, [[Alvin Liberman]] and colleagues were able to discover acoustic cues for the perception of [[phonetic]] segments (consonants and vowels).
Early electronic speech synthesizers sounded robotic and were often barely intelligible. However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.
=== Electronic devices ===
The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist [[John Larry Kelly, Jr]] and colleague Louis Gerstman used an [[IBM 704]] computer to synthesize speech, an event among the most prominent in the history of [[Bell Labs]]. Kelly's voice recorder synthesizer (vocoder) recreated the song "[[Daisy Bell]]", with musical accompaniment from [[Max Mathews]]. Coincidentally, [[Arthur C. Clarke]] was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel ''[[2001: A Space Odyssey (novel)|2001: A Space Odyssey]]'', where the [[HAL 9000]] computer sings the same song as it is being put to sleep by astronaut [[Dave Bowman]]. Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.
== Synthesizer technologies ==
The most important qualities of a speech synthesis system are ''naturalness'' and ''[[Intelligibility]]''. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms are ''concatenative synthesis'' and ''[[formant]] synthesis''. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.
=== Concatenative synthesis ===
Concatenative synthesis is based on the [[concatenation]] (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis.
==== Unit selection synthesis ====
Unit selection synthesis uses large [[database]]s of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual [[phone]]s, [[diphone]]s, half-phones, [[syllable]]s, [[morpheme]]s, [[word]]s, [[phrase]]s, and [[Sentence (linguistics)|sentence]]s. Typically, the division into segments is done using a specially modified [[speech recognition|speech recognizer]] set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the [[waveform]] and [[spectrogram]]. An [[index (database)|index]] of the units in the speech database is then created based on the segmentation and acoustic parameters like the [[fundamental frequency]] ([[pitch (music)|pitch]]), duration, position in the syllable, and neighboring phones. At [[runtime]], the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted [[decision tree]].
Unit selection provides the greatest naturalness, because it applies only a small amount of [[digital signal processing]] (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the [[gigabyte]]s of recorded data, representing dozens of hours of speech. Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.
==== Diphone synthesis ====
Diphone synthesis uses a minimal speech database containing all the [[diphone]]s (sound-to-sound transitions) occurring in a language. The number of diphones depends on the [[phonotactics]] of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target [[prosody]] of a sentence is superimposed on these minimal units by means of [[digital signal processing]] techniques such as [[linear predictive coding]], [[PSOLA]] or [[MBROLA]].
The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.
==== Domain-specific synthesis ====
Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.
Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in [[Rhotic and non-rhotic accents|non-rhotic]] dialects of English the in words like {{IPA|/ˈkliːə/}} is usually only pronounced when the following word has a vowel as its first letter (e.g. is realized as {{IPA|/ˌkliːəɹˈɑʊt/}}). Likewise in [[French language|French]], many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called [[Liaison (French)|liaison]]. This [[alternation (linguistics)|alternation]] cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be [[context-sensitive]].
=== Formant synthesis ===
[[Formant]] synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as [[fundamental frequency]], [[phonation|voicing]], and [[noise]] levels are varied over time to create a [[waveform]] of artificial speech. This method is sometimes called ''rules-based synthesis''; however, many concatenative systems also have rules-based components.
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a [[screen reader]]. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in [[embedded system]]s, where [[data storage device|memory]] and [[microprocessor]] power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and [[Intonation (linguistics)|intonation]]s can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.
Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the [[Texas Instruments]] toy [[Speak & Spell (game)|Speak & Spell]], and in the early 1980s [[Sega]] [[Video arcade|arcade]] machines. Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.
=== Articulatory synthesis ===
[[Articulatory synthesis]] refers to computational techniques for synthesizing speech based on models of the human [[vocal tract]] and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at [[Haskins Laboratories]] in the mid-1970s by [[Philip Rubin]], Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at [[Bell Laboratories]] in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the [[NeXT]]-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the [[University of Calgary]], where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by [[Steve Jobs]] in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the [[GNU General Public License]], with work continuing as ''gnuspeech''. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".
=== HMM-based synthesis ===
HMM-based synthesis is a synthesis method based on [[hidden Markov model]]s. In this system, the [[frequency spectrum]] ([[vocal tract]]), [[fundamental frequency]] (vocal source), and duration ([[prosody]]) of speech are modeled simultaneously by HMMs. Speech [[waveforms]] are generated from HMMs themselves based on the [[maximum likelihood]] criterion.
=== Sinewave synthesis ===
[[Sinewave synthesis]] is a technique for synthesizing speech by replacing the [[formants]] (main bands of energy) with pure tone whistles.
== Challenges ==
=== Text normalization challenges ===
The process of normalizing text is rarely straightforward. Texts are full of [[Heteronym (linguistics)|heteronym]]s, [[number]]s, and [[abbreviation]]s that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".
Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various [[heuristic]] techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.
Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; when a year or part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a [[social security number]], as "one three two five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.
Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.
=== Text-to-phoneme challenges ===
Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion ([[phoneme]] is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or [[synthetic phonics]], approach to learning reading.
Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary. As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.
Some languages, like [[Spanish language|Spanish]], have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful. Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like [[English language|English]], which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries.
=== Evaluation challenges ===
It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data. The quality of a speech synthesis system highly depends on the quality of recording. Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills.
Recently researchers start evaluating speech synthesis systems using the common speech dataset. This may help people to compare the difference between technologies rather than recordings.
=== Prosodics and emotional content ===
A recent study reported in the journal "'''Speech Communication'''" by Amy Drahota and colleagues at the [[University of Portsmouth]], [[UK]], reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling. It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural.
== Dedicated hardware ==
*Votrax
**SC-01A (analog formant)
**SC-02 / SSI-263 / "Arctic 263"
*General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
*Magnevation SpeakJet (www.speechchips.com TTS256)
*Savage Innovations SoundGin
*National Semiconductor DT1050 Digitalker (Mozer)
*Silicon Systems SSI 263 (analog formant)
*Texas Instruments
**TMS5110A (LPC)
**TMS5200
*Oki Semiconductor
**MSM5205
**MSM5218RS (ADPCM)
*Toshiba T6721A
*Philips PCF8200
== Computer operating systems or outlets with speech synthesis ==
=== Apple ===
The first speech system integrated into an [[operating system]] was [[Apple Computer]]'s [[PlainTalk#The original MacInTalk|MacInTalk]] in 1984. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowerPC based computers they included higher quality voice sampling. Apple also introduced [[speech recognition]] into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple [[Macintosh (computer)|Macintosh]] has evolved into a cutting edge fully-supported program, [[PlainTalk]], for people with vision problems. [[VoiceOver]] was included in Mac OS Tiger and more recently Mac OS Leopard. The voice shipping with Mac OS X 10.5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates.
=== AmigaOS ===
The second operating system with advanced speech synthesis capabilities was [[AmigaOS]], introduced in 1985. The voice synthesis was licensed by [[Commodore International]] from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the [[Amiga]] hardware audio [[chipset]]. It was divided into a narrator device and a translator library. Amiga [[AmigaOS#Speech synthesis|Speak Handler]] featured a text-to-speech translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system.
=== Microsoft Windows ===
Modern [[Microsoft Windows|Windows]] systems use [[Speech Application Programming Interface#SAPI 1-4 API family|SAPI4]]- and [[Speech Application Programming Interface#SAPI 5 API family|SAPI5]]-based speech systems that include a [[speech recognition]] engine (SRE). SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like [[Windows 95]] and [[Windows 98]]. [[Windows 2000]] added a speech synthesis program called [[Microsoft Narrator|Narrator]], directly available to users. All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system. [[Microsoft Speech Server]] is a complete package for voice synthesis and recognition, for commercial applications such as [[call centers]].
=== Internet ===
Currently, there are a number of [[Application software|applications]], [[plugin]]s and [[gadget]]s that can read messages directly from an [[e-mail client]] and web pages from a [[web browser]]. Some specialized [[Computer software|software]] can narrate [[RSS|RSS-feeds]]. On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to [[podcast]]s. On the other hand, on-line RSS-readers are available on almost any [[Personal computer|PC]] connected to the Internet. Users can download generated audio files to portable devices, e.g. with a help of [[podcast]] receiver, and listen to them while walking, jogging or commuting to work.
A growing field in internet based TTS technology is web-based assistive technology, e.g. Talklets. This web based approach to a traditionally locally installed form of software application can afford many of those requiring software for accessibility reason, the ability to access web content from public machines, or those belonging to others. While responsiveness is not as immediate as that of applications installed locally, the 'access anywhere' nature of it is the key benefit to this approach.
=== Others ===
* Some models of Texas Instruments home computers produced in 1979 and 1981 ([[TI-99/4A|Texas Instruments TI-99/4 and TI-99/4A]]) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral. TI used a proprietary [[codec]] to embed complete spoken phrases into applications, primarily video games.
* Systems that operate on free and open source software systems including [[Linux|GNU/Linux]] are various, and include [[open-source]] programs such as the [[Festival Speech Synthesis System]] which uses diphone-based synthesis (and can use a limited number of [[MBROLA]] voices), and gnuspeech which uses articulatory synthesis from the [[Free Software Foundation]]. Other commercial vendor software also runs on GNU/Linux.
* Several commercial companies are also developing speech synthesis systems (this list is reporting them just for the sake of information, not endorsing any specific product): [http://www.acapela-group.com Acapela Group], [[AT&T]], [[Cepstral]], [[DECtalk]], [[IBM ViaVoice]], [[IVONA|IVONA TTS]], [http://www.loquendo.com Loquendo TTS], [http://www.neospeech.com NeoSpeech TTS], [[Nuance Communications]], Rhetorical Systems, [http://www.svox.com SVOX] and [http://www.yakitome.com YAKiToMe!].
* Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), [[Lernout & Hauspie]] (bankrupt), [[SpeechWorks]] (bought by Nuance)
== Speech synthesis markup languages ==
A number of [[markup language]]s have been established for the rendition of text as speech in an [[XML]]-compliant format. The most recent is [[Speech Synthesis Markup Language]] (SSML), which became a [[W3C recommendation]] in 2004. Older speech synthesis markup languages include Java Speech Markup Language ([[JSML]]) and [[SABLE]]. Although each of these was proposed as a standard, none of them has been widely adopted.
Speech synthesis markup languages are distinguished from dialogue markup languages. [[VoiceXML]], for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.
==Applications==
===Accessibility===
Speech synthesis has long been a vital [[assistive technology]] tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of [[screenreaders]] for people with [[visual impairment]], but text-to-speech systems are now commonly used by people with [[dyslexia]] and other reading difficulties as well as by pre-literate youngsters. They are also frequently employed to aid those with severe [[speech impairment]] usually through a dedicated [[voice output communication aid]].
===News service===
Sites such as [[Ananova]] have used speech synthesis to convert written news to audio content, which can be used for mobile applications.
===Entertainment===
Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar. In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications.
Software such as [[Vocaloid]] can generate singing voices via lyrics and melody. This is also the aim of the Singing Computer project (which uses the [[GNU General Public License|GPL]] software [[GNU LilyPond|Lilypond]] and [[Festival Speech Synthesis System|Festival]]) to help blind people check their lyric input.
Statistical classification
'''Statistical classification''' is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a [[training set]] of previously labeled items.
Formally, the problem can be stated as follows: given training data produce a classifier which maps an object to its classification label . For example, if the problem is filtering spam, then is some representation of an email and is either "Spam" or "Non-Spam".
Statistical classification algorithms are typically used in [[pattern recognition]] systems.
'''Note:''' in [[community ecology]], the term "classification" is synonymous with what is commonly known (in [[machine learning]]) as [[data clustering|clustering]]. See that article for more information about purely [[unsupervised learning|unsupervised]] techniques.
* The second problem is to consider classification as an [[estimation]] problem, where the goal is to estimate a function of the form
:
where the feature vector input is , and the function f is typically parameterized by some parameters . In the [[Bayesian statistics|Bayesian]] approach to this problem, instead of choosing a single parameter vector , the result is integrated over all possible thetas, with the thetas weighted by how likely they are given the training data D:
:
* The third problem is related to the second, but the problem is to estimate the [[conditional probability|class-conditional probabilities]] and then use [[Bayes' rule]] to produce the class probability as in the second problem.
Examples of classification algorithms include:
* [[Linear classifier]]s
** [[Fisher's linear discriminant]]
** [[Logistic regression]]
** [[Naive Bayes classifier]]
** [[Perceptron]]
** [[Support vector machine]]s
* [[Quadratic classifier]]s
* [[Nearest_neighbor_(pattern_recognition)|k-nearest neighbor]]
* [[Boosting]]
* [[Decision tree]]s
** [[Random forest]]s
* [[Artificial neural networks|Neural network]]s
* [[Bayesian network]]s
* [[Hidden Markov model]]s
An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers). Van der Walt and Barnard (see reference section) investigated very specific artificial data sets to determine conditions under which certain classifiers perform better and worse than others.
Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the [[No free lunch in search and optimization|No-free-lunch theorem]]). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science.
The most widely used classifiers are the [[Neural Network]] (Multi-layer Perceptron), [[Support Vector Machines]], [[KNN|k-Nearest Neighbours]], Gaussian Mixture Model, Gaussian, [[Naive Bayes]], [[Decision Tree]] and [[Radial Basis Function|RBF]] classifiers.
== Evaluation ==
The measures [[Precision and Recall]] are popular metrics used to evaluate the quality of a classification system. More recently, [[Receiver Operating Characteristic]] (ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms.
==Application domains==
* [[Computer vision]]
** [[Medical Imaging]] and Medical Image Analysis
** [[Optical character recognition]]
* [[Geostatistics]]
* [[Speech recognition]]
* [[Handwriting recognition]]
* [[Biometric]] identification
* [[Natural language processing]]
* [[Document classification]]
* Internet [[search engines]]
* [[Credit scoring]]
Statistical machine translation
'''Statistical machine translation''' ('''SMT''') is a [[machine translation]] paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual [[text corpora]]. The statistical approach contrasts with the rule-based approaches to [[machine translation]] as well as with [[example-based machine translation]].
The first ideas of statistical machine translation were introduced by [[Warren Weaver]] in 1949, including the ideas of applying [[Claude Shannon]]'s [[information theory]]. Statistical machine translation was re-introduced in 1991 by researchers at [[IBM]]'s [[Thomas J. Watson Research Center]] and has contributed to the significant resurgence in interest in machine translation in recent years. As of 2006, it is by far the most widely-studied machine translation paradigm.
==Benefits==
The benefits of statistical machine translation over traditional paradigms that are most often cited are the following:
* '''Better use of resources'''
**There is a great deal of natural language in machine-readable format.
**Generally, SMT systems are not tailored to any specific pair of languages.
**Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
* '''More natural translations'''
The ideas behind statistical machine translation come out of [[information theory]]. Essentially, the document is translated on the [[probability]] that a string in native language (for example, English) is the translation of a string in foreign language (for example, French). Generally, these probabilities are estimated using techniques of [[parameter estimation]].
The [[Bayes Theorem]] is applied to , the probability that the foreign string produces the native string to get , where the [[translation model]] is the probability that the native string is the translation of the foreign string, and the [[language model]] is the probability of seeing that native string.
Mathematically speaking, finding the best translation is done by picking up the one that gives the highest probability:
:.
For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings in the native language. Performing the search efficiently is the work of a [[machine translation decoder]] that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality. This trade-off between quality and time usage can also be found in [[speech recognition]].
As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough. Language models are typically approximated by smoothed ''n''-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word orders in the languages.
The statistical translation models were initially [[word]] based (Models 1-5 from [[IBM]]), but significant advances were made with the introduction of [[phrase]] based models. Recent work has incorporated [[syntax]] or quasi-syntactic structures.
==Word-based translation==
In word-based translation, translated elements are words. Typically, the number of words in translated sentences are different due to compound words, morphology and idioms. The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces. Simple word-based translation is not able to translate language pairs with fertility rates different from one. To make word-based translation systems manage, for instance, high fertility rates, the system could be able to map a single word to multiple words, but not vice versa. For instance, if we are translating from French to English, each word in English could produce zero or more French words. But there's no way to group two English words producing a single French word.
An example of a word-based translation system is the freely available [[GIZA++]] package ([[GPL]]ed), which includes [[IBM]] models.
==Phrase-based translation==
In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ. The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic [[phrase]]s but phrases found using statistical methods from the corpus. Restricting the phrases to linguistic phrases has been shown to decrease translation quality.
==Syntax-based translation==
==Challenges with statistical machine translation==
Problems that statistical machine translation have to deal with include
=== Compound words ===
=== Idioms ===
=== Morphology ===
=== Different word orders ===
Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located.
In [[Speech Recognition]], the speech signal and the corresponding textual representation can be mapped to each other in blocks in order. This is not always the case with the same text in two languages. For SMT, the translation model is only able to translate small sequences of words and word order has to be taken into account somehow. Typical solution has been re-ordering models, where a distribution of location changes for
each item of translation is approximated from aligned bi-text. Different location changes can be ranked
with the help of the language model and the best can be selected.
=== Syntax ===
=== Out of vocabulary (OOV) words ===
SMT systems store different word forms as separate symbols without any relation to each other and word forms
or phrases that were not in the training data cannot be translated. Main reasons for out of vocabulary words are the limitation of training data, domain changes and morphology.
Statistics
'''Statistics''' is a [[Mathematics|mathematical science]] pertaining to the collection, analysis, interpretation or explanation, and presentation of [[data]]. It is applicable to a wide variety of [[academic discipline]]s, from the [[Natural science|natural]] and [[social science]]s to the [[humanities]], government and business.
Statistical methods can be used to summarize or describe a collection of data; this is called '''[[descriptive statistics]]'''. In addition, patterns in the data may be [[mathematical model|modeled]] in a way that accounts for [[random]]ness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called '''[[inferential statistics]]'''. Both descriptive and inferential statistics comprise '''applied statistics'''. There is also a discipline called '''[[mathematical statistics]]''', which is concerned with the theoretical basis of the subject.
The word '''''statistics''''' is also the plural of '''''[[statistic]]''''' (singular), which refers to the result of applying a statistical algorithm to a set of data, as in [[economic statistics]], [[crime statistics]], etc.
==History==
:
''"Five men, [[Hermann Conring|Conring]],[[Gottfried Achenwall| Achenwall]], [[Johann Peter Süssmilch|Süssmilch]], [[John Graunt|Graunt]] and [[William Petty|Petty]] have been honored by different writers as the founder of statistics."'' claims one source (Willcox, Walter (1938) ''The Founder of Statistics''. Review of the [[International Statistical Institute]] 5(4):321-328.)
Some scholars pinpoint the origin of statistics to 1662, with the publication of "[[Observations on the Bills of Mortality]]" by John Graunt. Early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data. The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general. Today, statistics is widely employed in government, business, and the natural and social sciences.
Because of its empirical roots and its applications, statistics is generally considered not to be a subfield of pure mathematics, but rather a distinct branch of applied mathematics. Its mathematical foundations were laid in the 17th century with the development of [[probability theory]] by [[Pascal]] and [[Fermat]]. Probability theory arose from the study of games of chance. The [[method of least squares]] was first described by [[Carl Friedrich Gauss]] around 1794. The use of modern [[computer]]s has expedited large-scale statistical computation, and has also made possible new methods that are impractical to perform manually.
==Overview==
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or [[statistical population|population]] to be studied. This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period. It may instead be a process observed at various times; data collected about this kind of "population" constitute what is called a [[time series]].
For practical reasons, rather than compiling data about an entire population, one usually studies a chosen subset of the population, called a [[sampling (statistics)|sample]]. Data are collected about the sample in an observational or [[experiment]]al setting. The data are then subjected to statistical analysis, which serves two related purposes: description and inference.
*[[Descriptive statistics]] can be used to summarize the data, either numerically or graphically, to describe the sample. Basic examples of numerical descriptors include the [[mean]] and [[standard deviation]]. Graphical summarizations include various kinds of charts and graphs.
*[[Inferential statistics]] is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions ([[hypothesis testing]]), estimates of numerical characteristics ([[estimation]]), descriptions of association ([[correlation]]), or modeling of relationships ([[regression analysis|regression]]). Other [[mathematical model|modeling]] techniques include [[ANOVA]], [[time series]], and [[data mining]].
The concept of correlation is particularly noteworthy. Statistical analysis of a [[data set]] may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected. For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated (which is a positive correlation in this case). However, one cannot immediately infer the existence of a causal relationship between the two variables. (See [[Correlation does not imply causation]].) The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a [[lurking variable]] or [[confounding variable]].
If the sample is representative of the population, then inferences and conclusions made from the sample can be extended to the population as a whole. A major problem lies in determining the extent to which the chosen sample is representative. Statistics offers methods to estimate and correct for randomness in the sample and in the data collection procedure, as well as methods for designing robust experiments in the first place. (See [[experimental design]].)
The fundamental mathematical concept employed in understanding such randomness is [[probability]]. [[Mathematical statistics]] (also called [[statistical theory]]) is the branch of [[applied mathematics]] that uses probability theory and [[mathematical analysis|analysis]] to examine the theoretical basis of statistics.
The use of any statistical method is valid only when the system or population under consideration satisfies the basic mathematical assumptions of the method. [[Misuse of statistics]] can produce subtle but serious errors in description and interpretation — subtle in the sense that even experienced professionals sometimes make such errors, serious in the sense that they may affect, for instance, social policy, medical practice and the reliability of structures such as bridges.
Even when statistics is correctly applied, the results can be difficult for the non-expert to interpret. For example, the [[statistical significance]] of a trend in the data, which measures the extent to which the trend could be caused by random variation in the sample, may not agree with one's intuitive sense of its significance. The set of basic statistical skills (and skepticism) needed by people to deal with information in their everyday lives is referred to as [[statistical literacy]].
==Statistical methods==
===Experimental and observational studies===
A common goal for a statistical research project is to investigate [[causality]], and in particular to draw a conclusion on the effect of changes in the values of predictors or [[independent variable]]s on response or [[dependent variable]]s. There are two major types of causal statistical studies, experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective.
An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated.
An example of an experimental study is the famous [[Hawthorne studies]], which attempted to test the changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the [[assembly line]] workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected the productivity. It turned out that the productivity indeed improved (under the experimental conditions). (See [[Hawthorne effect]].) However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a [[control group]] and [[double-blind|blindedness]].
An example of an observational study is a study which explores the correlation between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a [[case-control study]], and then look for the number of cases of lung cancer in each group.
The basic steps of an experiment are;
# Planning the research, including determining information sources, research subject selection, and [[ethics|ethical]] considerations for the proposed research and method.
# [[Design of experiments]], concentrating on the system model and the interaction of independent and dependent variables.
# [[summary statistics|Summarizing a collection of observations]] to feature their commonality by suppressing details. ([[Descriptive statistics]])
# Reaching consensus about what [[statistical inference|the observations tell]] about the world being observed. ([[Statistical inference]])
# Documenting / presenting the results of the study.
===Levels of measurement===
:''See: [[Levels of measurement|Stanley Stevens' "Scales of measurement" (1946): nominal, ordinal, interval, ratio]]''
There are four types of measurements or [[level of measurement|levels of measurement]] or measurement scales used in statistics: nominal, ordinal, interval, and ratio. They have different degrees of usefulness in statistical [[research]]. Ratio measurements have both a zero value defined and the distances between different measurements defined; they provide the greatest flexibility in statistical methods that can be used for analyzing the data. Interval measurements have meaningful distances between measurements defined, but have no meaningful zero value defined (as in the case with IQ measurements or with temperature measurements in [[Fahrenheit]]). Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values. Nominal measurements have no meaningful rank order among values.
Since variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are called together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative or [[continuous variables]] due to their numerical nature.
===Statistical techniques===
Some well known statistical [[Statistical hypothesis testing|test]]s and [[procedure]]s for [[research]] [[observation]]s are:
* [[Student's t-test]]
* [[chi-square test]]
* [[Analysis of variance]] (ANOVA)
* [[Mann-Whitney U]]
* [[Regression analysis]]
* [[Factor Analysis]]
* [[Correlation]]
* [[Pearson product-moment correlation coefficient]]
* [[Spearman's rank correlation coefficient]]
* [[Time Series Analysis]]
==Specialized disciplines==
Some fields of inquiry use applied statistics so extensively that they have [[specialized terminology]]. These disciplines include:
* [[Actuarial science]]
* [[Applied information economics]]
* [[Biostatistics]]
* [[Bootstrapping (statistics)|Bootstrap]] & [[Resampling (statistics)|Jackknife Resampling]]
* [[Business statistics]]
* [[Data analysis]]
* [[Data mining]] (applying statistics and [[pattern recognition]] to discover knowledge from data)
* [[Demography]]
* [[Economic statistics]] (Econometrics)
* [[Energy statistics]]
* [[Engineering statistics]]
* [[Environmental Statistics]]
* [[Epidemiology]]
* [[Geography]] and [[Geographic Information Systems]], more specifically in [[Spatial analysis]]
* [[Image processing]]
* [[Multivariate statistics|Multivariate Analysis]]
* [[Psychological statistics]]
* [[Quality]]
* [[Social statistics]]
* [[Statistical literacy]]
* [[Statistical modeling]]
* [[Statistical survey]]s
* Process analysis and [[chemometrics]] (for analysis of data from [[analytical chemistry]] and [[chemical engineering]])
* [[Structured data analysis (statistics)]]
* [[Survival analysis]]
* [[Reliability engineering]]
* Statistics in various sports, particularly [[Baseball statistics|baseball]] and [[Cricket statistics|cricket]]
Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in [[statistical process control]] or SPC), for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool.
==Statistical computing==
The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science. Early statistical models were almost always from the class of [[linear model]]s, but powerful computers, coupled with suitable numerical [[algorithms]], caused an increased interest in [[nonlinear regression|nonlinear models]] (especially [[neural networks]] and [[decision tree]]s) as well as the creation of new types, such as [[generalized linear model|generalised linear model]]s and [[multilevel model]]s.
Increased computing power has also led to the growing popularity of computationally-intensive methods based on [[resampling (statistics)|resampling]], such as permutation tests and the [[bootstrapping (statistics)|bootstrap]], while techniques such as [[Gibbs sampling]] have made Bayesian methods more feasible. The computer revolution has implications for the future of statistics with new emphasis on "experimental" and "empirical" statistics. A large number of both general and special purpose [[List of statistical packages|statistical software]] are now available.
== Misuse ==
:
There is a general perception that statistical knowledge is all-too-frequently intentionally [[Misuse of statistics|misused]] by finding ways to interpret only the data that are favorable to the presenter. A famous saying attributed to [[Benjamin Disraeli]] is, "[[Lies, damned lies, and statistics|There are three kinds of lies: lies, damned lies, and statistics]]"; and Harvard President [[Lawrence Lowell]] wrote in 1909 that statistics, ''"like veal pies, are good if you know the person that made them, and are sure of the ingredients"''.
If various studies appear to contradict one another, then the public may come to distrust such studies. For example, one study may suggest that a given diet or activity raises [[blood pressure]], while another may suggest that it lowers blood pressure. The discrepancy can arise from subtle variations in experimental design, such as differences in the patient groups or research protocols, that are not easily understood by the non-expert. (Media reports sometimes omit this vital contextual information entirely.)
By choosing (or rejecting, or modifying) a certain sample, results can be manipulated. Such manipulations need not be malicious or devious; they can arise from unintentional biases of the researcher. The graphs used to summarize data can also be misleading.
Deeper criticisms come from the fact that the hypothesis testing approach, widely used and in many cases required by law or regulation, forces one hypothesis (the [[null hypothesis]]) to be "favored", and can also seem to exaggerate the importance of minor differences in large studies. A difference that is highly statistically significant can still be of no practical significance. (See [[Hypothesis test#Criticism|criticism of hypothesis testing]] and [[Null hypothesis#Controversy|controversy over the null hypothesis]].)
One response is by giving a greater emphasis on the [[p-value|''p''-value]] than simply reporting whether a hypothesis is rejected at the given level of significance. The ''p''-value, however, does not indicate the size of the effect. Another increasingly common approach is to report [[confidence interval]]s. Although these are produced from the same calculations as those of hypothesis tests or ''p''-values, they describe both the size of the effect and the uncertainty surrounding it.
Syntax
In [[linguistics]], '''syntax''' (from [[Ancient Greek]] {{lang|grc|συν-}} ''syn-'', "together", and {{lang|grc|τάξις}} ''táxis'', "arrangement") is the study of the principles and rules for constructing [[sentence]]s in [[natural language]]s. In addition to referring to the discipline, the term ''syntax'' is also used to refer directly to the rules and principles that govern the sentence structure of any individual language, as in "the [[Irish syntax|syntax of Modern Irish]]". Modern research in syntax attempts to [[descriptive linguistics|describe languages]] in terms of such rules. Many professionals in this discipline attempt to find [[Universal Grammar|general rules]] that apply to all natural languages. The term ''syntax'' is also sometimes used to refer to the rules governing the behavior of mathematical systems, such as [[logic]], artificial formal languages, and computer programming languages.
== Early history ==
Works on grammar were being written long before modern syntax came about; the ''Aṣṭādhyāyī'' of [[Pāṇini]] is often cited as an example of a pre-modern work that approaches the sophistication of a modern syntactic theory. In the West, the school of thought that came to be known as "traditional grammar" began with the work of [[Dionysius Thrax]].
For centuries, work in syntax was dominated by a framework known as {{lang|fr|''grammaire générale''}}, first expounded in 1660 by [[Antoine Arnauld]] in a book of the same title. This system took as its basic premise the assumption that language is a direct reflection of thought processes and therefore there is a single, most natural way to express a thought. That way, coincidentally, was exactly the way it was expressed in French.
However, in the 19th century, with the development of [[historical-comparative linguistics]], linguists began to realize the sheer diversity of human language, and to question fundamental assumptions about the relationship between language and logic. It became apparent that there was no such thing as a most natural way to express a thought, and therefore logic could no longer be relied upon as a basis for studying the structure of language.
The Port-Royal grammar modeled the study of syntax upon that of logic (indeed, large parts of the [[Port-Royal Logic]] were copied or adapted from the ''Grammaire générale''). Syntactic categories were identified with logical ones, and all sentences were analyzed in terms of "Subject – Copula – Predicate". Initially, this view was adopted even by the early comparative linguists such as [[Franz Bopp]].
The central role of syntax within theoretical linguistics became clear only in the 20th century, which could reasonably be called the "century of syntactic theory" as far as linguistics is concerned. For a detailed and critical survey of the history of syntax in the last two centuries, see the monumental work by Graffi (2001).
==Modern theories==
There are a number of theoretical approaches to the discipline of syntax. Many linguists (e.g. [[Noam Chomsky]]) see syntax as a branch of biology, since they conceive of syntax as the study of linguistic knowledge as embodied in the human [[mind]]. Others (e.g. [[Gerald Gazdar]]) take a more [[Philosophy of mathematics#Platonism|Platonistic]] view, since they regard syntax to be the study of an abstract [[formal system]]. Yet others (e.g. [[Joseph Greenberg]]) consider grammar a taxonomical device to reach broad generalizations across languages. Some of the major approaches to the discipline are listed below.
===Generative grammar===
The hypothesis of [[generative grammar]] is that language is a structure of the human mind. The goal of generative grammar is to make a complete model of this inner language (known as ''[[i-language]]''). This model could be used to describe all human language and to predict the [[grammaticality]] of any given utterance (that is, to predict whether the utterance would sound correct to native speakers of the language). This approach to language was pioneered by [[Noam Chomsky]]. Most generative theories (although not all of them) assume that syntax is based upon the constituent structure of sentences. Generative grammars are among the theories that focus primarily on the form of a sentence, rather than its communicative function.
Among the many generative theories of linguistics are:
*[[Transformational Grammar]] (TG) (now largely out of date)
*[[Government and binding theory]] (GB) (common in the late 1970s and 1980s)
*[[Linguistic minimalism|Minimalism]] (MP) (the most recent Chomskyan version of generative grammar)
Other theories that find their origin in the generative paradigm are:
*[[Generative semantics]] (now largely out of date)
*[[Relational grammar]] (RG) (now largely out of date)
*[[Arc Pair grammar]]
*[[Generalised phrase structure grammar|Generalized phrase structure grammar]] (GPSG; now largely out of date)
*[[Head-driven phrase structure grammar]] (HPSG)
*[[Lexical-functional grammar]] (LFG)
===Categorial grammar ===
[[Categorial grammar]] is an approach that attributes the syntactic structure not to rules of grammar, but to the properties of the [[syntactic categories]] themselves. For example, rather than asserting that sentences are constructed by a rule that combines a noun phrase (NP) and a verb phrase (VP) (e.g. the [[phrase structure rule]] S → NP VP), in categorial grammar, such principles are embedded in the category of the [[head (linguistics)|head]] word itself. So the syntactic category for an [[intransitive]] verb is a complex formula representing the fact that the verb acts as a [[functor]] which requires an NP as an input and produces a sentence level structure as an output. This complex category is notated as (NP\S) instead of V. NP\S is read as " a category that searches to the left (indicated by \) for a NP (the element on the left) and outputs a sentence (the element on the right)". The category of [[transitive verb]] is defined as an element that requires two NPs (its subject and its direct object) to form a sentence. This is notated as (NP/(NP\S)) which means "a category that searches to the right (indicated by /) for an NP (the object), and generates a function (equivalent to the VP) which is (NP\S), which in turn represents a function that searches to the left for an NP and produces a sentence). [[Tree-adjoining grammar]] is a categorial grammar that adds in partial [[tree structure]]s to the categories.
===Dependency grammar===
[[Dependency grammar]] is a different type of approach in which structure is determined by the [[relation]]s (such as [[grammatical relation]]s) between a word (a ''[[head (linguistics)|head]]'') and its dependents, rather than being based in constituent structure. For example, syntactic structure is described in terms of whether a particular [[noun]] is the [[subject]] or [[agent]] of the [[verb]], rather than describing the relations in terms of trees (one version of which is the [[parse tree]]) or other structural system.
Some dependency-based theories of syntax:
*[[Algebraic syntax]]
*[[Word grammar]]
*[[Operator Grammar]]
===Stochastic/probabilistic grammars/network theories ===
Theoretical approaches to syntax that are based upon [[probability theory]] are known as [[stochastic grammar]]s. One common implementation of such an approach makes use of a [[neural network]] or [[connectionism]]. Some theories based within this approach are:
*[[Optimality theory]]
*[[Stochastic context-free grammar]]
===Functionalist grammars===
Functionalist theories, although focused upon form, are driven by explanation based upon the function of a sentence (i.e. its communicative function). Some typical functionalist theories include:
*[[Functional grammar]] (Dik)
*[[Prague Linguistic Circle]]
*[[Systemic functional grammar]]
*[[Cognitive grammar]]
*[[Construction grammar]] (CxG)
*[[Role and reference grammar]] (RRG)
SYSTRAN
'''SYSTRAN''', founded by Dr. [[Peter Toma]] in [[1968]], is one of the oldest [[machine translation]] companies. SYSTRAN has done extensive work for the [[United States Department of Defense]] and the [[European Commission]].
SYSTRAN provides the technology for [[Yahoo!]] and [[AltaVista]]'s ([[Babel Fish (website)|Babel Fish]]) among others, but use of it was ended (circa 2007) for all of the language combinations offered by [[Google]]'s [[List of Google products#anchor_language_tools|language tools]].
Commercial versions of SYSTRAN operate with operating systems [[Microsoft Windows]] (including [[Windows Mobile]]), [[Linux]] and [[Solaris (operating system)|Solaris]].
== History ==
With its origin in the [[Georgetown-IBM experiment|Georgetown]] machine translation effort, SYSTRAN was one of the few machine translation systems to survive the major decrease of funding after the [[ALPAC|ALPAC Report]] of the mid-1960's. The company was established in [[La Jolla, San Diego, California|La Jolla]], [[California]] to work on translation of Russian to English text for the [[United States Air Force]] during the "[[Cold War]]". Large numbers of Russian scientific and technical documents were translated using SYSTRAN under the auspices of the USAF Foreign Technology Division (later the National Air and Space Intelligence Center) at [[Wright-Patterson Air Force Base]], Ohio. The quality of the translations, although only approximate, was usually adequate for understanding content.
The company was sold during 1986 to the Gachot family, based in [[Paris]], [[France]], and is now traded publicly by the French stock exchange. It has a main office at the [[Grande Arche]] in [[La Defense]] and maintains a secondary office in [[La Jolla, San Diego, California]].
== Languages ==
Here is a list of the source and target languages SYSTRAN works with.
Many of the pairs are to or from English or French.
* Russian into English (1968)
* English into Russian (1973) for the [[Apollo-Soyuz]] project
* English source (1975) for the [[European Commission]]
* Arabic
* Chinese
* Danish
* Dutch
* French
* German
* Greek
* Hindi
* Italian
* Japanese
* Korean
* Norwegian
* Serbo-Croatian
* Spanish
* Swedish
* Persian
* Polish
* Portuguese
* Ukrainian
* Urdu
Text analytics
The term '''text analytics''' describes a set of linguistic, lexical, pattern recognition,
extraction, tagging/structuring, visualization, and predictive techniques. The term
also describes processes that apply these techniques, whether independently or in
conjunction with query and analysis of fielded, numerical data, to solve business
problems. These techniques and processes discover and present knowledge – facts,
business rules, and relationships – that is otherwise locked in textual form, impenetrable
to automated processing.
A typical application is to scan a set of documents written in a [[natural language]] and either model the document set for predictive classification purposes or populate a database or search index with the information extracted. Current approaches to text analytics use [[natural language processing]] techniques that focus on specialized domains.
Typical subtasks are:
* [[Named Entity Recognition]]: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
* [[Coreference]]: identification chains of [[noun phrase]]s that refer to the same object. For example, [[Anaphora (linguistics)|anaphora]] is a type of coreference.
* [[Relationship Extraction]]: extraction of named relationships between entities in text