Algorithm
In mathematics, computing, linguistics and related disciplines, an algorithm is a sequence of instructions, often used for calculation and data processing.
It is formally a type of effective method in which a list of well-defined instructions for completing a task will, when given an initial state, proceed through a well-defined series of successive states, eventually terminating in an end-state.
The transition from one state to the next is not necessarily deterministic; some algorithms, known as probabilistic algorithms, incorporate randomness.
A partial formalization of the concept began with attempts to solve the Entscheidungsproblem (the "decision problem") posed by David Hilbert in 1928.
Subsequent formalizations were framed as attempts to define "effective calculability" (Kleene 1943:274) or "effective method" (Rosser 1939:225); those formalizations included the Gödel-Herbrand-Kleene recursive functions of 1930, 1934 and 1935, Alonzo Church's lambda calculus of 1936, Emil Post's "Formulation I" of 1936, and Alan Turing's Turing machines of 1936-7 and 1939.
Etymology
Al-Khwārizmī, Persian astronomer and mathematician, wrote a treatise in Arabic in 825 AD, On Calculation with Hindu Numerals.
(See algorism).
It was translated into Latin in the 12th century as Algoritmi de numero Indorum (al-Daffa 1977), which title was likely intended to mean "Algoritmi on the numbers of the Indians", where "Algoritmi" was the translator's rendition of the author's name; but people misunderstanding the title treated Algoritmi as a Latin plural and this led to the word "algorithm" (Latin algorismus) coming to mean "calculation method".
The intrusive "th" is most likely due to a false cognate with the Greek (arithmos) meaning "number".
Why algorithms are necessary: an informal definition
No generally accepted formal definition of "algorithm" exists yet.
An informal definition could be "an algorithm is a computer program that calculates something."
For some people, a program is only an algorithm if it stops eventually.
For others, a program is only an algorithm if it stops before a given number of calculation steps.
A prototypical example of an "algorithm" is Euclid's algorithm to determine the maximum common divisor of two integers greater than one: "subtract the smallest number from the biggest one, repeat until you get a zero or a one".
This procedure is know to stop always, and the number of subtractions needed is always smaller than the biggest of the two numbers.
We can derive clues to the issues involved and an informal meaning of the word from the following quotation from (boldface added):
No human being can write fast enough, or long enough, or small enough to list all members of an enumerably infinite set by writing out their names, one after another, in some notation.
But humans can do something equally useful, in the case of certain enumerably infinite sets: They can give explicit instructions for determining the nth member of the set, for arbitrary finite n.
Such instructions are to be given quite explicitly, in a form in which they could be followed by a computing machine, or by a human who is capable of carrying out only very elementary operations on symbols
The words "enumerably infinite" mean "countable using integers perhaps extending to infinity".
Thus Boolos and Jeffrey are saying that an algorithm implies instructions for a process that "creates" output integers from an arbitrary "input" integer or integers that, in theory, can be chosen from 0 to infinity.
Thus we might expect an algorithm to be an algebraic equation such as y = m + n — two arbitrary "input variables" m and n that produce an output y.
As we see in Algorithm characterizations — the word algorithm implies much more than this, something on the order of (for our addition example):
Precise instructions (in language understood by "the computer") for a "fast, efficient, good" process that specifies the "moves" of "the computer" (machine or human, equipped with the necessary internally-contained information and capabilities) to find, decode, and then munch arbitrary input integers/symbols m and n, symbols + and = ... and (reliably, correctly, "effectively") produce, in a "reasonable" time, output-integer y at a specified place and in a specified format.
The concept of algorithm is also used to define the notion of decidability.
That notion is central for explaining how formal systems come into being starting from a small set of axioms and rules.
In logic, the time that an algorithm requires to complete cannot be measured, as it is not apparently related with our customary physical dimension.
From such uncertainties, that characterize ongoing work, stems the unavailability of a definition of algorithm that suits both concrete (in some sense) and abstract usage of the term.
For a detailed presentation of the various points of view around the definition of "algorithm" see Algorithm characterizations.
For examples of simple addition algorithms specified in the detailed manner described in Algorithm characterizations, see Algorithm examples.
Formalization of algorithms
Algorithms are essential to the way computers process information, because a computer program is essentially an algorithm that tells the computer what specific steps to perform (in what specific order) in order to carry out a specified task, such as calculating employees’ paychecks or printing students’ report cards.
Thus, an algorithm can be considered to be any sequence of operations that can be performed by a Turing-complete system.
Authors who assert this thesis include Savage (1987) and Gurevich (2000):
...Turing's informal argument in favor of his thesis justifies a stronger thesis: every algorithm can be simulated by a Turing machine (Gurevich 2000:1)...according to Savage [1987], an algorithm is a computational process defined by a Turing machine.
(Gurevich 2000:3)
Typically, when an algorithm is associated with processing information, data are read from an input source or device, written to an output sink or device, and/or stored for further processing.
Stored data are regarded as part of the internal state of the entity performing the algorithm.
In practice, the state is stored in a data structure, but an algorithm requires the internal data only for specific operation sets called abstract data types.
For any such computational process, the algorithm must be rigorously defined: specified in the way it applies in all possible circumstances that could arise.
That is, any conditional steps must be systematically dealt with, case-by-case; the criteria for each case must be clear (and computable).
Because an algorithm is a precise list of precise steps, the order of computation will almost always be critical to the functioning of the algorithm.
Instructions are usually assumed to be listed explicitly, and are described as starting "from the top" and going "down to the bottom", an idea that is described more formally by flow of control.
So far, this discussion of the formalization of an algorithm has assumed the premises of imperative programming.
This is the most common conception, and it attempts to describe a task in discrete, "mechanical" means.
Unique to this conception of formalized algorithms is the assignment operation, setting the value of a variable.
It derives from the intuition of "memory" as a scratchpad.
There is an example below of such an assignment.
For some alternate conceptions of what constitutes an algorithm see functional programming and logic programming .
Termination
Some writers restrict the definition of algorithm to procedures that eventually finish.
In such a category Kleene places the "decision procedure or decision method or algorithm for the question" (Kleene 1952:136).
Others, including Kleene, include procedures that could run forever without stopping; such a procedure has been called a "computational method" (Knuth 1997:5) or "calculation procedure or algorithm" (Kleene 1952:137); however, Kleene notes that such a method must eventually exhibit "some object" (Kleene 1952:137).
Minsky makes the pertinent observation, in regards to determining whether an algorithm will eventually terminate (from a particular starting state):
But if the length of the process is not known in advance, then "trying" it may not be decisive, because if the process does go on forever — then at no time will we ever be sure of the answer (Minsky 1967:105).
As it happens, no other method can do any better, as was shown by Alan Turing with his celebrated result on the undecidability of the so-called halting problem.
There is no algorithmic procedure for determining of arbitrary algorithms whether or not they terminate from given starting states.
The analysis of algorithms for their likelihood of termination is called termination analysis.
See the examples of (im-)"proper" subtraction at partial function for more about what can happen when an algorithm fails for certain of its input numbers — e.g., (i) non-termination, (ii) production of "junk" (output in the wrong format to be considered a number) or no number(s) at all (halt ends the computation with no output), (iii) wrong number(s), or (iv) a combination of these.
Kleene proposed that the production of "junk" or failure to produce a number is solved by having the algorithm detect these instances and produce e.g., an error message (he suggested "0"), or preferably, force the algorithm into an endless loop (Kleene 1952:322).
Davis does this to his subtraction algorithm — he fixes his algorithm in a second example so that it is proper subtraction (Davis 1958:12-15).
Along with the logical outcomes "true" and "false" Kleene also proposes the use of a third logical symbol "u" — undecided (Kleene 1952:326) — thus an algorithm will always produce something when confronted with a "proposition".
The problem of wrong answers must be solved with an independent "proof" of the algorithm e.g., using induction:
We normally require auxiliary evidence for this (that the algorithm correctly defines a mu recursive function), e.g., in the form of an inductive proof that, for each argument value, the computation terminates with a unique value (Minsky 1967:186).
Expressing algorithms
Algorithms can be expressed in many kinds of notation, including natural languages, pseudocode, flowcharts, and programming languages.
Natural language expressions of algorithms tend to be verbose and ambiguous, and are rarely used for complex or technical algorithms.
Pseudocode and flowcharts are structured ways to express algorithms that avoid many of the ambiguities common in natural language statements, while remaining independent of a particular implementation language.
Programming languages are primarily intended for expressing algorithms in a form that can be executed by a computer, but are often used as a way to define or document algorithms.
There is a wide variety of representations possible and one can express a given Turing machine program as a sequence of machine tables (see more at finite state machine and state transition table), as flowcharts (see more at state diagram), or as a form of rudimentary machine code or assembly code called "sets of quadruples" (see more at Turing machine).
Sometimes it is helpful in the description of an algorithm to supplement small "flow charts" (state diagrams) with natural-language and/or arithmetic expressions written inside "block diagrams" to summarize what the "flow charts" are accomplishing.
Representations of algorithms are generally classed into three accepted levels of Turing machine description (Sipser 2006:157):
1 High-level description:
"...prose to describe an algorithm, ignoring the implementation details.
At this level we do not need to mention how the machine manages its tape or head"
2 Implementation description:
"...prose used to define the way the Turing machine uses its head and the way that it stores data on its tape.
At this level we do not give details of states or transition function"
3 Formal description:
Most detailed, "lowest level", gives the Turing machine's "state table".
For an example of the simple algorithm "Add m+n" described in all three levels see Algorithm examples.
Implementation
Most algorithms are intended to be implemented as computer programs.
However, algorithms are also implemented by other means, such as in a biological neural network (for example, the human brain implementing arithmetic or an insect looking for food), in an electrical circuit, or in a mechanical device.
Example
One of the simplest algorithms is to find the largest number in an (unsorted) list of numbers.
The solution necessarily requires looking at every number in the list, but only once at each.
From this follows a simple algorithm, which can be stated in a high-level description English prose, as:
High-level description:
Assume the first item is largest.
Look at each of the remaining items in the list and if it is larger than the largest item so far, make a note of it.
The last noted item is the largest in the list when the process is complete.
(Quasi-)formal description: Written in prose but much closer to the high-level language of a computer program, the following is the more formal coding of the algorithm in pseudocode or pidgin code:
Input: A non-empty list of numbers L.
Output: The largest number in the list L. largest ← L0 for each item in the list L≥1, do if the item > largest, then largest ← the item return largest
For a more complex example of an algorithm, see Euclid's algorithm for the greatest common divisor, one of the earliest algorithms known.
Algorithm analysis
As it happens, it is important to know how much of a particular resource (such as time or storage) is required for a given algorithm.
Methods have been developed for the analysis of algorithms to obtain such quantitative answers; for example, the algorithm above has a time requirement of O(n), using the big O notation with n as the length of the list.
At all times the algorithm only needs to remember two values: the largest number found so far, and its current position in the input list.
Therefore it is said to have a space requirement of O(1), if the space required to store the input numbers is not counted, or O (log n) if it is counted.
Different algorithms may complete the same task with a different set of instructions in less or more time, space, or effort than others.
For example, given two different recipes for making potato salad, one may have peel the potato before boil the potato while the other presents the steps in the reverse order, yet they both call for these steps to be repeated for all potatoes and end when the potato salad is ready to be eaten.
The analysis and study of algorithms is a discipline of computer science, and is often practiced abstractly without the use of a specific programming language or implementation.
In this sense, algorithm analysis resembles other mathematical disciplines in that it focuses on the underlying properties of the algorithm and not on the specifics of any particular implementation.
Usually pseudocode is used for analysis as it is the simplest and most general representation.
Classes
There are various ways to classify algorithms, each with its own merits.
Classification by implementation
One way to classify algorithms is by implementation means.
Recursion or iteration: A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition matches, which is a method common to functional programming.
Iterative algorithms use repetitive constructs like loops and sometimes additional data structures like stacks to solve the given problems.
Some problems are naturally suited for one implementation or the other.
For example, towers of hanoi is well understood in recursive implementation.
Every recursive version has an equivalent (but possibly more or less complex) iterative version, and vice versa.
Logical: An algorithm may be viewed as controlled logical deduction.
This notion may be expressed as: Algorithm = logic + control (Kowalski 1979).
The logic component expresses the axioms that may be used in the computation and the control component determines the way in which deduction is applied to the axioms.
This is the basis for the logic programming paradigm.
In pure logic programming languages the control component is fixed and algorithms are specified by supplying only the logic component.
The appeal of this approach is the elegant semantics: a change in the axioms has a well defined change in the algorithm.
Serial or parallel or distributed: Algorithms are usually discussed with the assumption that computers execute one instruction of an algorithm at a time.
Those computers are sometimes called serial computers.
An algorithm designed for such an environment is called a serial algorithm, as opposed to parallel algorithms or distributed algorithms.
Parallel algorithms take advantage of computer architectures where several processors can work on a problem at the same time, whereas distributed algorithms utilize multiple machines connected with a network.
Parallel or distributed algorithms divide the problem into more symmetrical or asymmetrical subproblems and collect the results back together.
The resource consumption in such algorithms is not only processor cycles on each processor but also the communication overhead between the processors.
Sorting algorithms can be parallelized efficiently, but their communication overhead is expensive.
Iterative algorithms are generally parallelizable.
Some problems have no parallel algorithms, and are called inherently serial problems.
Deterministic or non-deterministic: Deterministic algorithms solve the problem with exact decision at every step of the algorithm whereas non-deterministic algorithm solve problems via guessing although typical guesses are made more accurate through the use of heuristics.
Exact or approximate: While many algorithms reach an exact solution, approximation algorithms seek an approximation that is close to the true solution.
Approximation may use either a deterministic or a random strategy.
Such algorithms have practical value for many hard problems.
Classification by design paradigm
Another way of classifying algorithms is by their design methodology or paradigm.
There is a certain number of paradigms, each different from the other.
Furthermore, each of these categories will include many different types of algorithms.
Some commonly found paradigms include:
Divide and conquer.
A divide and conquer algorithm repeatedly reduces an instance of a problem to one or more smaller instances of the same problem (usually recursively), until the instances are small enough to solve easily.
One such example of divide and conquer is merge sorting.
Sorting can be done on each segment of data after dividing data into segments and sorting of entire data can be obtained in conquer phase by merging them.
A simpler variant of divide and conquer is called decrease and conquer algorithm, that solves an identical subproblem and uses the solution of this subproblem to solve the bigger problem.
Divide and conquer divides the problem into multiple subproblems and so conquer stage will be more complex than decrease and conquer algorithms.
An example of decrease and conquer algorithm is binary search algorithm.
Dynamic programming.
When a problem shows optimal substructure, meaning the optimal solution to a problem can be constructed from optimal solutions to subproblems, and overlapping subproblems, meaning the same subproblems are used to solve many different problem instances, a quicker approach called dynamic programming avoids recomputing solutions that have already been computed.
For example, the shortest path to a goal from a vertex in a weighted graph can be found by using the shortest path to the goal from all adjacent vertices.
Dynamic programming and memoization go together.
The main difference between dynamic programming and divide and conquer is that subproblems are more or less independent in divide and conquer, whereas subproblems overlap in dynamic programming.
The difference between dynamic programming and straightforward recursion is in caching or memoization of recursive calls.
When subproblems are independent and there is no repetition, memoization does not help; hence dynamic programming is not a solution for all complex problems.
By using memoization or maintaining a table of subproblems already solved, dynamic programming reduces the exponential nature of many problems to polynomial complexity.
The greedy method.
A greedy algorithm is similar to a dynamic programming algorithm, but the difference is that solutions to the subproblems do not have to be known at each stage; instead a "greedy" choice can be made of what looks best for the moment.
The greedy method extends the solution with the best possible decision (not all feasible decisions) at an algorithmic stage based on the current local optimum and the best decision (not all possible decisions) made in previous stage.
It is not exhaustive, and does not give accurate answer to many problems.
But when it works, it will be the fastest method.
The most popular greedy algorithm is finding the minimal spanning tree as given by Kruskal.
Linear programming.
When solving a problem using linear programming, specific inequalities involving the inputs are found and then an attempt is made to maximize (or minimize) some linear function of the inputs.
Many problems (such as the maximum flow for directed graphs) can be stated in a linear programming way, and then be solved by a 'generic' algorithm such as the simplex algorithm.
A more complex variant of linear programming is called integer programming, where the solution space is restricted to the integers.
Reduction.
This technique involves solving a difficult problem by transforming it into a better known problem for which we have (hopefully) asymptotically optimal algorithms.
The goal is to find a reducing algorithm whose complexity is not dominated by the resulting reduced algorithm's.
For example, one selection algorithm for finding the median in an unsorted list involves first sorting the list (the expensive portion) and then pulling out the middle element in the sorted list (the cheap portion).
This technique is also known as transform and conquer.
Search and enumeration.
Many problems (such as playing chess) can be modeled as problems on graphs.
A graph exploration algorithm specifies rules for moving around a graph and is useful for such problems.
This category also includes search algorithms, branch and bound enumeration and backtracking.
The probabilistic and heuristic paradigm.
Algorithms belonging to this class fit the definition of an algorithm more loosely.
Probabilistic algorithms are those that make some choices randomly (or pseudo-randomly); for some problems, it can in fact be proven that the fastest solutions must involve some randomness.
Genetic algorithms attempt to find solutions to problems by mimicking biological evolutionary processes, with a cycle of random mutations yielding successive generations of "solutions".
Thus, they emulate reproduction and "survival of the fittest".
In genetic programming, this approach is extended to algorithms, by regarding the algorithm itself as a "solution" to a problem.
Heuristic algorithms, whose general purpose is not to find an optimal solution, but an approximate solution where the time or resources are limited.
They are not practical to find perfect solutions.
An example of this would be local search, tabu search, or simulated annealing algorithms, a class of heuristic probabilistic algorithms that vary the solution of a problem by a random amount.
The name "simulated annealing" alludes to the metallurgic term meaning the heating and cooling of metal to achieve freedom from defects.
The purpose of the random variance is to find close to globally optimal solutions rather than simply locally optimal ones, the idea being that the random element will be decreased as the algorithm settles down to a solution.
Classification by field of study
Every field of science has its own problems and needs efficient algorithms.
Related problems in one field are often studied together.
Some example classes are search algorithms, sorting algorithms, merge algorithms, numerical algorithms, graph algorithms, string algorithms, computational geometric algorithms, combinatorial algorithms, machine learning, cryptography, data compression algorithms and parsing techniques.
Fields tend to overlap with each other, and algorithm advances in one field may improve those of other, sometimes completely unrelated, fields.
For example, dynamic programming was originally invented for optimization of resource consumption in industry, but is now used in solving a broad range of problems in many fields.
Classification by complexity
Algorithms can be classified by the amount of time they need to complete compared to their input size.
There is a wide variety: some algorithms complete in linear time relative to input size, some do so in an exponential amount of time or even worse, and some never halt.
Additionally, some problems may have multiple algorithms of differing complexity, while other problems might have no algorithms or no known efficient algorithms.
There are also mappings from some problems to other problems.
Owing to this, it was found to be more suitable to classify the problems themselves instead of the algorithms into equivalence classes based on the complexity of the best possible algorithms for them.
Classification by computing power
Another way to classify algorithms is by computing power.
This is typically done by considering some collection (class) of algorithms.
A recursive class of algorithms is one that includes algorithms for all Turing computable functions.
Looking at classes of algorithms allows for the possibility of restricting the available computational resources (time and memory) used in a computation.
A subrecursive class of algorithms is one in which not all Turing computable functions can be obtained.
For example, the algorithms that run in polynomial time suffice for many important types of computation but do not exhaust all Turing computable functions.
The class algorithms implemented by primitive recursive functions is another subrecursive class.
Burgin (2005, p. 24) uses a generalized definition of algorithms that relaxes the common requirement that the output of the algorithm that computes a function must be determined after a finite number of steps.
He defines a super-recursive class of algorithms as "a class of algorithms in which it is possible to compute functions not computable by any Turing machine" (Burgin 2005, p. 107).
This is closely related to the study of methods of hypercomputation.
Legal issues
See also: Software patents for a general overview of the patentability of software, including computer-implemented algorithms.
Algorithms, by themselves, are not usually patentable.
In the United States, a claim consisting solely of simple manipulations of abstract concepts, numbers, or signals do not constitute "processes" (USPTO 2006) and hence algorithms are not patentable (as in Gottschalk v. Benson).
However, practical applications of algorithms are sometimes patentable.
For example, in Diamond v. Diehr, the application of a simple feedback algorithm to aid in the curing of synthetic rubber was deemed patentable.
The patenting of software is highly controversial, and there are highly criticized patents involving algorithms, especially data compression algorithms, such as Unisys' LZW patent.
Additionally, some cryptographic algorithms have export restrictions (see export of cryptography).
History: Development of the notion of "algorithm"
Origin of the word
The word algorithm comes from the name of the 9th century Persian mathematician Abu Abdullah Muhammad ibn Musa al-Khwarizmi whose works introduced Indian numerals and algebraic concepts.
He worked in Baghdad at the time when it was the centre of scientific studies and trade.
The word algorism originally referred only to the rules of performing arithmetic using Arabic numerals but evolved via European Latin translation of al-Khwarizmi's name into algorithm by the 18th century.
The word evolved to include all definite procedures for solving problems or performing tasks.
Discrete and distinguishable symbols
Tally-marks: To keep track of their flocks, their sacks of grain and their money the ancients used tallying: accumulating stones or marks scratched on sticks, or making discrete symbols in clay.
Through the Babylonian and Egyptian use of marks and symbols, eventually Roman numerals and the abacus evolved (Dilson, p.16–41).
Tally marks appear prominently in unary numeral system arithmetic used in Turing machine and Post-Turing machine computations.
Manipulation of symbols as "place holders" for numbers: algebra
The work of the Ancient Greek geometers, Persian mathematician Al-Khwarizmi (often considered as the "father of algebra"), and Western European mathematicians culminated in Leibniz's notion of the calculus ratiocinator (ca 1680):
"A good century and a half ahead of his time, Leibniz proposed an algebra of logic, an algebra that would specify the rules for manipulating logical concepts in the manner that ordinary algebra specifies the rules for manipulating numbers" (Davis 2000:1)
Mechanical contrivances with discrete states
The clock: Bolter credits the invention of the weight-driven clock as “The key invention [of Europe in the Middle Ages]", in particular the verge escapement< (Bolter 1984:24) that provides us with the tick and tock of a mechanical clock.
“The accurate automatic machine” (Bolter 1984:26) led immediately to "mechanical automata" beginning in the thirteenth century and finally to “computational machines" – the difference engine and analytical engines of Charles Babbage and Countess Ada Lovelace (Bolter p.33–34, p.204–206).
Jacquard loom, Hollerith punch cards, telegraphy and telephony — the electromechanical relay: Bell and Newell (1971) indicate that the Jacquard loom (1801), precursor to Hollerith cards (punch cards, 1887), and “telephone switching technologies” were the roots of a tree leading to the development of the first computers (Bell and Newell diagram p. 39, cf Davis 2000).
By the mid-1800s the telegraph, the precursor of the telephone, was in use throughout the world, its discrete and distinguishable encoding of letters as “dots and dashes” a common sound.
By the late 1800s the ticker tape (ca 1870s) was in use, as was the use of Hollerith cards in the 1890 U.S. census.
Then came the Teletype (ca 1910) with its punched-paper use of Baudot code on tape.
Telephone-switching networks of electromechanical relays (invented 1835) was behind the work of George Stibitz (1937), the inventor of the digital adding device.
As he worked in Bell Laboratories, he observed the “burdensome’ use of mechanical calculators with gears.
"He went home one evening in 1937 intending to test his idea....
When the tinkering was over, Stibitz had constructed a binary adding device".
(Valley News, p. 13).
Davis (2000) observes the particular importance of the electromechanical relay (with its two "binary states" open and closed):
It was only with the development, beginning in the 1930s, of electromechanical calculators using electrical relays, that machines were built having the scope Babbage had envisioned."
(Davis, p. 14).
Mathematics during the 1800s up to the mid-1900s
Symbols and rules: In rapid succession the mathematics of George Boole (1847, 1854), Gottlob Frege (1879), and Giuseppe Peano (1888–1889) reduced arithmetic to a sequence of symbols manipulated by rules.
Peano's The principles of arithmetic, presented by a new method (1888) was "the first attempt at an axiomatization of mathematics in a symbolic language" (van Heijenoort:81ff).
But Heijenoort gives Frege (1879) this kudos: Frege’s is "perhaps the most important single work ever written in logic. ... in which we see a " 'formula language', that is a lingua characterica, a language written with special symbols, "for pure thought", that is, free from rhetorical embellishments ... constructed from specific symbols that are manipulated according to definite rules" (van Heijenoort:1).
The work of Frege was further simplified and amplified by Alfred North Whitehead and Bertrand Russell in their Principia Mathematica (1910–1913).
The paradoxes: At the same time a number of disturbing paradoxes appeared in the literature, in particular the Burali-Forti paradox (1897), the Russell paradox (1902–03), and the Richard Paradox (Dixon 1906, cf Kleene 1952:36–40).
The resultant considerations led to Kurt Gödel’s paper (1931) — he specifically cites the paradox of the liar — that completely reduces rules of recursion to numbers.
Effective calculability: In an effort to solve the Entscheidungsproblem defined precisely by Hilbert in 1928, mathematicians first set about to define what was meant by an "effective method" or "effective calculation" or "effective calculability" (i.e., a calculation that would succeed).
In rapid succession the following appeared: Alonzo Church, Stephen Kleene and J.B. Rosser's λ-calculus, (cf footnote in Alonzo Church 1936a:90, 1936b:110) a finely-honed definition of "general recursion" from the work of Gödel acting on suggestions of Jacques Herbrand (cf Gödel's Princeton lectures of 1934) and subsequent simplifications by Kleene (1935-6:237ff, 1943:255ff). Church's proof (1936:88ff) that the Entscheidungsproblem was unsolvable, Emil Post's definition of effective calculability as a worker mindlessly following a list of instructions to move left or right through a sequence of rooms and while there either mark or erase a paper or observe the paper and make a yes-no decision about the next instruction (cf "Formulation I", Post 1936:289-290).
Alan Turing's proof of that the Entscheidungsproblem was unsolvable by use of his "a- [automatic-] machine"(Turing 1936-7:116ff) -- in effect almost identical to Post's "formulation", J. Barkley Rosser's definition of "effective method" in terms of "a machine" (Rosser 1939:226).
S. C. Kleene's proposal of a precursor to "Church thesis" that he called "Thesis I" (Kleene 1943:273–274), and a few years later Kleene's renaming his Thesis "Church's Thesis" (Kleene 1952:300, 317) and proposing "Turing's Thesis" (Kleene 1952:376).
Emil Post (1936) and Alan Turing (1936-7, 1939)
Here is a remarkable coincidence of two men not knowing each other but describing a process of men-as-computers working on computations — and they yield virtually identical definitions.
Emil Post (1936) described the actions of a "computer" (human being) as follows:
"...two concepts are involved: that of a symbol space in which the work leading from problem to answer is to be carried out, and a fixed unalterable set of directions.
His symbol space would be
"a two way infinite sequence of spaces or boxes...
The problem solver or worker is to move and work in this symbol space, being capable of being in, and operating in but one box at a time.... a box is to admit of but two possible conditions, i.e., being empty or unmarked, and having a single mark in it, say a vertical stroke.
"One box is to be singled out and called the starting point. ...a specific problem is to be given in symbolic form by a finite number of boxes [i.e., INPUT] being marked with a stroke.
Likewise the answer [i.e., OUTPUT] is to be given in symbolic form by such a configuration of marked boxes....
"A set of directions applicable to a general problem sets up a deterministic process when applied to each specific problem.
This process will terminate only when it comes to the direction of type (C ) [i.e., STOP]." (U p. 289–290)
See more at Post-Turing machine
Alan Turing’s work (1936, 1939:160) preceded that of Stibitz (1937); it is unknown whether Stibitz knew of the work of Turing.
Turing’s biographer believed that Turing’s use of a typewriter-like model derived from a youthful interest: “Alan had dreamt of inventing typewriters as a boy; Mrs. Turing had a typewriter; and he could well have begun by asking himself what was meant by calling a typewriter 'mechanical'" (Hodges, p. 96).
Given the prevalence of Morse code and telegraphy, ticker tape machines, and Teletypes we might conjecture that all were influences.
Turing — his model of computation is now called a Turing machine — begins, as did Post, with an analysis of a human computer that he whittles down to a simple set of basic motions and "states of mind".
But he continues a step further and creates a machine as a model of computation of numbers (Turing 1936-7:116).
"Computing is normally done by writing certain symbols on paper.
We may suppose this paper is divided into squares like a child's arithmetic book....I assume then that the computation is carried out on one-dimensional paper, i.e., on a tape divided into squares.
I shall also suppose that the number of symbols which may be printed is finite....
"The behavior of the computer at any moment is determined by the symbols which he is observing, and his "state of mind" at that moment.
We may suppose that there is a bound B to the number of symbols or squares which the computer can observe at one moment.
If he wishes to observe more, he must use successive observations.
We will also suppose that the number of states of mind which need be taken into account is finite...
"Let us imagine that the operations performed by the computer to be split up into 'simple operations' which are so elementary that it is not easy to imagine them further divided" (Turing 1936-7:136).
Turing's reduction yields the following:
"The simple operations must therefore include:
"(a) Changes of the symbol on one of the observed squares
"(b) Changes of one of the squares observed to another square within L squares of one of the previously observed squares.
"It may be that some of these change necessarily invoke a change of state of mind.
The most general single operation must therefore be taken to be one of the following:
"(A) A possible change (a) of symbol together with a possible change of state of mind.
"(B) A possible change (b) of observed squares, together with a possible change of state of mind"
"We may now construct a machine to do the work of this computer."
(Turing 1936-7:136)
A few years later, Turing expanded his analysis (thesis, definition) with this forceful expression of it:
"A function is said to be "effectively calculable" if its values can be found by some purely mechanical process.
Although it is fairly easy to get an intuitive grasp of this idea, it is neverthessless desirable to have some more definite, mathematical expressible definition . . . [he discusses the history of the definition pretty much as presented above with respect to Gödel, Herbrand, Kleene, Church, Turing and Post] . . .
We may take this statement literally, understanding by a purely mechanical process one which could be carried out by a machine.
It is possible to give a mathematical description, in a certain normal form, of the structures of these machines.
The development of these ideas leads to the author's definition of a computable function, and to an identification of computability † with effective calculability . . . .
"† We shall use the expression "computable function" to mean a function calculable by a machine, and we let "effectively calculabile" refer to the intuitive idea without particular identification with any one of these definitions."(Turing 1939:160)
J. B. Rosser (1939) and S. C. Kleene (1943)
J. Barkley Rosser boldly defined an ‘effective [mathematical] method’ in the following manner (boldface added):
"'Effective method' is used here in the rather special sense of a method each step of which is precisely determined and which is certain to produce the answer in a finite number of steps.
With this special meaning, three different precise definitions have been given to date. [his footnote #5; see discussion immediately below].
The simplest of these to state (due to Post and Turing) says essentially that an effective method of solving certain sets of problems exists if one can build a machine which will then solve any problem of the set with no human intervention beyond inserting the question and (later) reading the answer.
All three definitions are equivalent, so it doesn't matter which one is used.
Moreover, the fact that all three are equivalent is a very strong argument for the correctness of any one."
(Rosser 1939:225–6)
Rosser's footnote #5 references the work of (1) Church and Kleene and their definition of λ-definability, in particular Church's use of it in his An Unsolvable Problem of Elementary Number Theory (1936); (2) Herbrand and Gödel and their use of recursion in particular Gödel's use in his famous paper On Formally Undecidable Propositions of Principia Mathematica and Related Systems I (1931); and (3) Post (1936) and Turing (1936-7) in their mechanism-models of computation.
Stephen C. Kleene defined as his now-famous "Thesis I" known as the Church-Turing thesis.
But he did this in the following context (boldface in original):
"12.
Algorithmic theories...
In setting up a complete algorithmic theory, what we do is to describe a procedure, performable for each set of values of the independent variables, which procedure necessarily terminates and in such manner that from the outcome we can read a definite answer, "yes" or "no," to the question, "is the predicate value true?”"
(Kleene 1943:273)
History after 1950
A number of efforts have been directed toward further refinement of the definition of "algorithm", and activity is on-going because of issues surrounding, in particular, foundations of mathematics (especially the Church-Turing Thesis) and philosophy of mind (especially arguments around artificial intelligence).
For more, see Algorithm characterizations.
Algorithmic Repositories
LEDA
Stanford GraphBase
Combinatorica
Netlib
XTango
Ambiguity
Ambiguity is the property of being ambiguous, where a word, term, notation, sign, symbol, phrase, sentence, or any other form used for communication, is called ambiguous if it can be interpreted in more than one way.
Ambiguity is distinct from vagueness, which arises when the boundaries of meaning are indistinct.
Ambiguity is context-dependent: the same communication may be ambiguous in one context and unambiguous in another context.
For a word, ambiguity typically refers to an unclear choice between different definitions as may be found in a dictionary.
A sentence may be ambiguous due to different ways of parsing the same sequence of words.
Linguistic forms
Lexical ambiguity arises when context is insufficient to determine the sense of a single word that has more than one meaning.
For example, the word “bank” has several distinct definitions, including “financial institution” and “edge of a river,” but if someone says “I deposited $100 in the bank,” most people would not think you used a shovel to dig in the mud.
The word "run" has 130 ambiguous definitions in some lexicons.
"Biweekly" can mean "fortnightly" (once every two weeks - 26 times a year), OR "twice a week" (104 times a year).
Stating a specific context like "meeting schedule" does NOT disambiguate "biweekly."
Many people believe that such lexically-ambiguous, miscommunication-prone words should be avoided altogether, since the user generally has to waste time, effort, and attention span to define what is meant when they are used.
The use of multi-defined words requires the author or speaker to clarify their context, and sometimes elaborate on their specific intended meaning (in which case, a less ambiguous term should have been used).
The goal of clear concise communication is that the receiver(s) have no misunderstanding about what was meant to be conveyed.
An exception to this could include a politician whose "wiggle words" and obfuscation are necessary to gain support from multiple constituent (politics) with mutually exclusive conflicting desires from their candidate of choice.
Ambiguity is a powerful tool of political science.
More problematic are words whose senses express closely-related concepts.
“Good,” for example, can mean “useful” or “functional” (That’s a good hammer), “exemplary” (She’s a good student), “pleasing” (This is good soup), “moral” (a good person versus the lesson to be learned from a story), "righteous", etc.
“I have a good daughter” is not clear about which sense is intended.
The various ways to apply prefixes and suffixes can also create ambiguity (“unlockable” can mean “capable of being unlocked” or “impossible to lock”, and therefore should not be used).
Syntactic ambiguity arises when a sentence can be parsed in more than one way.
“He ate the cookies on the couch,” for example, could mean that he ate those cookies which were on the couch (as opposed to those that were on the table), or it could mean that he was sitting on the couch when he ate the cookies.
Spoken language can contain many more types of ambiguities, where there is more than one way to compose a set of sounds into words, for example “ice cream” and “I scream.”
Such ambiguity is generally resolved based on the context.
A mishearing of such, based on incorrectly-resolved ambiguity, is called a mondegreen.
Semantic ambiguity arises when a word or concept has an inherently diffuse meaning based on widespread or informal usage.
This is often the case, for example, with idiomatic expressions whose definitions are rarely or never well-defined, and are presented in the context of a larger argument that invites a conclusion.
For example, “You could do with a new automobile.
How about a test drive?”
The clause “You could do with” presents a statement with such wide possible interpretation as to be essentially meaningless.
Lexical ambiguity is contrasted with semantic ambiguity.
The former represents a choice between a finite number of known and meaningful context-dependent interpretations.
The latter represents a choice between any number of possible interpretations, none of which may have a standard agreed-upon meaning.
This form of ambiguity is closely related to vagueness.
Linguistic ambiguity can be a problem in law (see Ambiguity (law)), because the interpretation of written documents and oral agreements is often of paramount importance.
Intentional application
Philosophers (and other users of logic) spend a lot of time and effort searching for and removing (or intentionally adding) ambiguity in arguments, because it can lead to incorrect conclusions and can be used to deliberately conceal bad arguments.
For example, a politician might say “I oppose taxes that hinder economic growth.”
Some will think he opposes taxes in general, because they hinder economic growth.
Others may think he opposes only those taxes that he believes will hinder economic growth (although in writing, the correct insertion or omission of a comma after “taxes” and the use of "which" can help reduce ambiguity here.
For the first meaning, “, which” is properly used in place of “that”), or restructure the sentence to completely eliminate possible misinterpretation.
The devious politician hopes that each constituent (politics) will interpret the above statement in the most desirable way, and think the politician supports everyone's opinion.
However, the opposite can also be true - An opponent can turn a positive statement into a bad one, if the speaker uses ambiguity (intentionally or not).
The logical fallacies of amphiboly and equivocation rely heavily on the use of ambiguous words and phrases.
In literature and rhetoric, on the other hand, ambiguity can be a useful tool.
Groucho Marx’s classic joke depends on a grammatical ambiguity for its humor, for example: “Last night I shot an elephant in my pajamas.
What he was doing in my pajamas I’ll never know.”
Ambiguity can also be used as a comic device through a genuine intention to confuse, as does Magic: The Gathering's Unhinged © Ambiguity, which makes puns with homophones, mispunctuation, and run-ons: “Whenever a player plays a spell that counters a spell that has been played[,] or a player plays a spell that comes into play with counters, that player may counter the next spell played[,] or put an additional counter on a permanent that has already been played, but not countered.”
Songs and poetry often rely on ambiguous words for artistic effect, as in the song title “Don’t It Make My Brown Eyes Blue” (where “blue” can refer to the color, or to sadness).
In narrative, ambiguity can be introduced in several ways: motive, plot, character.
F. Scott Fitzgerald uses the latter type of ambiguity with notable effect in his novel The Great Gatsby.
All religions debate the orthodoxy or heterodoxy of ambiguity.
Christianity and Judaism employ the concept of paradox synonymously with 'ambiguity'.
Ambiguity within Christianity (and other religions) is resisted by the conservatives and fundamentalists, who regard the concept as equating with 'contradiction'.
Non-fundamentalist Christians and Jews endorse Rudolf Otto's description of the sacred as 'mysterium tremendum et fascinans', the awe-inspiring mystery which fascinates humans.
Metonymy involves the use of the name of a subcomponent part as an abbreviation, or jargon, for the name of the whole object (for example "wheels" to refer to a car, or "flowers" to refer to beautiful offspring, an entire plant, or a collection of blooming plants).
In modern vocabulary critical semiotics, metonymy encompasses any potentially-ambiguous word substitution that is based on contextual contiguity (located close together), or a function or process that an object performs, such as "sweet ride" to refer to a nice car.
Metonym miscommunication is considered a primary mechanism of linguistic humour.
Psychology and management
In sociology and social psychology, the term "ambiguity" is used to indicate situations that involve uncertainty.
An increasing amount of research is concentrating on how people react and respond to ambiguous situations.
Much of this focuses on ambiguity tolerance.
A number of correlations have been found between an individual’s reaction and tolerance to ambiguity and a range of factors.
Apter and Desselles (2001) for example, found a strong correlation with such attributes and factors like a greater preference for safe as opposed to risk based sports, a preference for endurance type activities as opposed to explosive activities, a more organized and less casual lifestyle, greater care and precision in descriptions, a lower sensitivity to emotional and unpleasant words, a less acute sense of humour, engaging a smaller variety of sexual practices than their more risk comfortable colleagues, a lower likelihood of the use of drugs, pornography and drink, a greater likelihood of displaying obsessional behaviour.
In the field of leadership David Wilkinson (2006) found strong correlations between an individual leaders reaction to ambiguous situations and the Modes of Leadership they use, the type of creativity (Kirton (2003) and how they relate to others.
Music
In music, pieces or sections which confound expectations and may be or are interpreted simultaneously in different ways are ambiguous, such as some polytonality, polymeter, other ambiguous meters or rhythms, and ambiguous phrasing, or (Stein 2005, p.79) any aspect of music.
The music of Africa is often purposely ambiguous.
To quote Sir Donald Francis Tovey (1935, p.195), “Theorists are apt to vex themselves with vain efforts to remove uncertainty just where it has a high aesthetic value.”
Constructed language
Some languages have been created with the intention of avoiding ambiguity, especially lexical ambiguity.
Lojban and Loglan are two related languages which have been created with this in mind.
The languages can be both spoken and written.
These languages are intended to provide a greater technical precision over natural languages, although historically, such attempts at language improvement have been criticized.
Languages composed from many diverse sources contain much ambiguity and inconsistency.
The many exceptions to syntax and semantic rules are time-consuming and difficult to learn.
Mathematics and physics
Mathematical notation, widely used in physics and other sciences, avoids many ambiguities compared to expression in natural language.
However, for various reasons, several lexical, syntactic and semantic ambiguities remain.
Names of functions
The ambiguity in the style of writing a function should not be confused with a multivalued function, which can (and should) be defined in a deterministic and unambiguous way.
Several special functions still do not have established notations.
Usually, the conversion to another notation requires to scale the argument and/or the resulting value; sometimes, the same name of the function is used, causing confusions.
Examples of such underestablished functions:
Sinc function
Elliptic integral of the Third Kind; translating elliptic integral form MAPLE to Mathematica, one should replace the second argument to its square, see Talk:Elliptic integral#List_of_notations; dealing with complex values, this may cause problems.
Exponential integral, , page 228 http://www.math.sfu.ca/~cbm/aands/page_228.htm
Hermite polynomial, , page 775 http://www.math.sfu.ca/~cbm/aands/page_775.htm
Expressions
Ambiguous espressions often appear in physical and mathematical texts.
It is common practice to omit multiplication signs in mathematical expressions.
Also, it is common, to give the same name to a variable and a function, for example, .
Then, if one sees , there is no way to distinguish, does it mean multiplied by , or function evaluated at argument equal to .
In each case of use of such notations, the reader is supposed to be able to perform the deduction and reveal the true meaning.
Creators of algorithmic languages try to avoid ambiguities.
Many algorithmic languages (C++, MATLAB, Fortran, Maple) require the character * as symbol of multiplication.
The language Mathematica allows the user to omit the multiplication symbol, but requires square brackets to indicate the argument of a function; square brackets are not allowed for grouping of expressions.
Fortran, in addition, does not allow use of the same name (identifier) for different objects, for example, function and variable; in particular, the expression f=f(x) is qualified as an error.
The order of operations may depend on the context.
In most programming languages, the operations of division and multiplication have equal priority and are executed from left to right.
Until the last century, many editorials assumed that multiplication is performed first, for example, is interpreted as ; in this case, the insertion of parentheses is required when translating the formulas to an algorithmic language.
In addition, it is common to write an argument of a function without parenthesis, which also may lead to ambiguity.
Sometimes, one uses italics letters to denote elementary functions.
In the scientific journal style, the expression means product of variables , , and , although in a slideshow, it may mean .
Comma in subscripts and superscripts sometimes is omitted; it is also ambiguous notation.
If it is written , the reader should guess from the context, does it mean a single-index object, evaluated while the subscript is equal to product of variables , and , or it is indication to a three-valent tensor.
The writing of instead of may mean that the writer either is stretched in space (for example, to reduce the publication fees, or aims to increase number of publications without considering readers.
The same may apply to any other use of ambiguous notations.
Examples of potentially confusing ambiguous mathematical expressions
, which could be understood to mean either or .
, which by convention means , though it might be thought to mean since means .
, which arguably should mean but would commonly be understood to mean
Notations in quantum optics and quantum mechanics
It is common to define the coherent states in quantum optics with and states with fixed number of photons with .
Then, there is an "unwritten rule": the state is coherent if there are more Greek characters than Latin characters in the argument, and photon state if the Latin characters dominate.
The ambiguity becomes even worse, if is used for the states with certain value of the coordinate, and means the state with certain value of the momentum, which may be used in books on quantum mechanics.
Such ambiguities easy lead to confusions, especially if some normalized adimensional, dimensionless variables are used.
Expression may mean a state with single photon, or the coherent state with mean amplitude equal to 1, or state with momentum equal to unity, and so on.
The reader is supposed to guess from the context.
Examples of ambiguous terms in physics
Some physical quantities do not yet have established notations; their value (and sometimes even dimension, as in the case of the Einstein coefficients) depends on the system of notations.
A highly confusing term is gain.
For example, the sentence "the gain of a system should be doubled", without context, means close to nothing.
It may mean that the ratio of the output voltage of an electric circuit to the input voltage should be doubled.
It may mean that the ratio of the output power of an electric or optical circuit to the input power should be doubled.
It may mean that the gain of the laser medium should be doubled, for example, doubling the population of the upper laser level in a quasi-two level system (assuming negligible absorption of the ground-state).
Also, confusions may be related with the use of atomic percent as measure of concentration of a dopant, or resolution of an imaging system, as measure of the size of the smallest detail which still can be resolved at the background of statistical noise.
See also Accuracy and precision and its talk.
Many terms are ambiguous.
Each use of an ambiguous term should be preceded by the definition, suitable for a specific case.
The Berry paradox arises as a result of systematic ambiguity.
In various formulations of the Berry paradox, such as one that reads: The number not nameable in less than eleven syllables the term nameable is one that has this systematic ambiguity.
Terms of this kind give rise to vicious circle fallacies.
Other terms with this type of ambiguity are: satisfiable, definable, true, false, function, property, class, relation, cardinal, and ordinal.
Pedagogic use of ambiguous expressions
Ambiguity can be used as a pedagogical trick, to force students to reproduce the deduction by themselves.
Some textbooks give the same name to the function and to its Fourier transform:
.
Rigorously speaking, such an expression requires that ; even if function is a self-Fourier function, the expression should be written as ; however, it is assumed that the shape of the function (and even its norm ) depend on the character used to denote its argument.
If the Greek letter is used, it is assumed to be a Fourier transform of another function, The first function is assumed, if the expression in the argument contains more characters or , than characters , and the second function is assumed in the opposite case.
Expressions like or contain symbols and in equal amounts; they are ambiguous and should be avoided in serious deduction.
Artificial intelligence
Artificial intelligence (AI) is both the intelligence of machines and the branch of computer science which aims to create it.
Major AI textbooks define artificial intelligence as "the study and design of intelligent agents," where an intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success.
John McCarthy, who coined the term in 1956, defines it as "the science and engineering of making intelligent machines."
Among the traits that researchers hope machines will exhibit are reasoning, knowledge, planning, learning, communication, perception and the ability to move and manipulate objects.
General intelligence (or "strong AI") has not yet been achieved and is a long-term goal of some AI research.
AI research uses tools and insights from many fields, including computer science, psychology, philosophy, neuroscience, cognitive science, linguistics, ontology, operations research, economics, control theory, probability, optimization and logic.
AI research also overlaps with tasks such as robotics, control systems, scheduling, data mining, logistics, speech recognition, facial recognition and many others.
Other names for the field have been proposed, such as computational intelligence, synthetic intelligence, intelligent systems, or computational rationality.
Perspectives on AI
AI in myth, fiction and speculation
Humanity has imagined in great detail the implications of thinking machines or artificial beings.
They appear in Greek myths, such as Talos of Crete, the golden robots of Hephaestus and Pygmalion's Galatea.
The earliest known humanoid robots (or automatons) were sacred statues worshipped in Egypt and Greece, believed to have been endowed with genuine consciousness by craftsman.
In the sixteenth century, the alchemist Paracelsus claimed to have created artificial beings.
Realistic clockwork imitations of human beings have been built by people such as Yan Shi, Hero of Alexandria, Al-Jazari and Wolfgang von Kempelen.
In modern fiction, beginning with Mary Shelley's classic Frankenstein, writers have explored the ethical issues presented by thinking machines.
If a machine can be created that has intelligence, can it also feel?
If it can feel, does it have the same rights as a human being?
This is a key issue in Frankenstein as well as in modern science fiction: for example, the film Artificial Intelligence: A.I. considers a machine in the form of a small boy which has been given the ability to feel human emotions, including, tragically, the capacity to suffer.
This issue is also being considered by futurists, such as California's Institute for the Future under the name "robot rights", although many critics believe that the discussion is premature.
Science fiction writers and futurists have also speculated on the technology's potential impact on humanity.
In fiction, AI has appeared as a servant (R2D2 in Star Wars), a comrade (Lt. Commander Data in Star Trek), an extension to human abilities (Ghost in the Shell), a conqueror (The Matrix), a dictator (With Folded Hands) and an exterminator (Terminator, Battlestar Galactica).
Some realistic potential consequences of AI are decreased human labor demand, the enhancement of human ability or experience, and a need for redefinition of human identity and basic values.
Futurists estimate the capabilities of machines using Moore's Law, which measures the relentless exponential improvement in digital technology with uncanny accuracy.
Ray Kurzweil has calculated that desktop computers will have the same processing power as human brains by the year 2029, and that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past, a scenario that science fiction writer Vernor Vinge named the "technological singularity".
"Artificial intelligence is the next stage in evolution," Edward Fredkin said in the 1980s, expressing an idea first proposed by Samuel Butler's Darwin Among the Machines (1863), and expanded upon by George Dyson in his book of the same name (1998).
Several futurists and science fiction writers have predicted that human beings and machines will merge in the future into cyborgs that are more capable and powerful than either.
This idea, called transhumanism, has roots in Aldous Huxley and Robert Ettinger, is now associated with robot designer Hans Moravec, cyberneticist Kevin Warwick and Ray Kurzweil.
Transhumanism has been illustrated in fiction as well, for example on the manga Ghost in the Shell
History of AI research
In the middle of the 20th century, a handful of scientists began a new approach to building intelligent machines, based on recent discoveries in neurology, a new mathematical theory of information, an understanding of control and stability called cybernetics, and above all, by the invention of the digital computer, a machine based on the abstract essence of mathematical reasoning.
The field of modern AI research was founded at conference on the campus of Dartmouth College in the summer of 1956.
Those who attended would become the leaders of AI research for many decades, especially John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon, who founded AI laboratories at MIT, CMU and Stanford.
They and their students wrote programs that were, to most people, simply astonishing: computers were solving word problems in algebra, proving logical theorems and speaking English.
By the middle 60s their research was heavily funded by the U.S. Department of Defense and they were optimistic about the future of the new field:
1965, H. A. Simon: "[M]achines will be capable, within twenty years, of doing any work a man can do"
1967, Marvin Minsky: "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."
These predictions, and many like them, would not come true.
They had failed to recognize the difficulty of some of the problems they faced.
In 1974, in response to the criticism of England's Sir James Lighthill and ongoing pressure from Congress to fund more productive projects, the U.S. and British governments cut off all undirected, exploratory research in AI.
This was the first AI Winter.
In the early 80s, AI research was revived by the commercial success of expert systems (a form of AI program that simulated the knowledge and analytical skills of one or more human experts) and by 1985 the market for AI had reached more than a billion dollars.
Minsky and others warned the community that enthusiasm for AI had spiraled out of control and that disappointment was sure to follow.
Beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, more lasting AI Winter began.
In the 90s and early 21st century AI achieved its greatest successes, albeit somewhat behind the scenes.
Artificial intelligence was adopted throughout the technology industry, providing the heavy lifting for logistics, data mining, medical diagnosis and many other areas.
The success was due to several factors: the incredible power of computers today (see Moore's law), a greater emphasis on solving specific subproblems, the creation of new ties between AI and other fields working on similar problems, and above all a new commitment by researchers to solid mathematical methods and rigorous scientific standards.
Philosophy of AI
In a classic 1950 paper, Alan Turing posed the question "Can Machines Think?"
In the years since, the philosophy of artificial intelligence has attempted to answer it.
Turing's "polite convention": If a machine acts as intelligently as a human being, then it is as intelligent as a human being.
Alan Turing theorized that, ultimately, we can only judge the intelligence of machine based on its behavior.
This theory forms the basis of the Turing test.
The Dartmouth proposal: Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.
This assertion was printed in the proposal for the Dartmouth Conference of 1956, and represents the position of most working AI researchers.
Newell and Simon's physical symbol system hypothesis: A physical symbol system has the necessary and sufficient means of general intelligent action.
This statement claims that the essence of intelligence is symbol manipulation.
Hubert Dreyfus argued that, on the contrary, human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a "feel" for the situation rather than explicit symbolic knowledge.
Gödel's incompleteness theorem: A physical symbol system can not prove all true statements.
Roger Penrose is among those who claim that Gödel's theorem limits what machines can do.
Searle's "strong AI position": A physical symbol system can have a mind and mental states.
Searle counters this assertion with his Chinese room argument, which asks us to look inside the computer and try to find where the "mind" might be.
The artificial brain argument: The brain can be simulated.
Hans Moravec, Ray Kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software, and that such a simulation will be essentially identical to the original.
This argument combines the idea that a suitably powerful machine can simulate any process, with the materialist idea that the mind is the result of a physical process in the brain.
AI research
Problems of AI
While there is no universally accepted definition of intelligence, AI researchers have studied several traits that are considered essential.
Deduction, reasoning, problem solving
Early AI researchers developed algorithms that imitated the process of conscious, step-by-step reasoning that human beings use when they solve puzzles, play board games, or make logical deductions.
By the late 80s and 90s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.
For difficult problems, most of these algorithms can require enormous computational resources — most experience a "combinatorial explosion": the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size.
The search for more efficient problem solving algorithms is a high priority for AI research.
It is not clear, however, that conscious human reasoning is any more efficient when faced with a difficult abstract problem.
Cognitive scientists have demonstrated that human beings solve most of their problems using unconscious reasoning, rather than the conscious, step-by-step deduction that early AI research was able to model.
Embodied cognitive science argues that unconscious sensorimotor skills are essential to our problem solving abilities.
It is hoped that sub-symbolic methods, like computational intelligence and situated AI, will be able to model these instinctive skills.
The problem of unconscious problem solving, which forms part of our commonsense reasoning, is largely unsolved.
Knowledge representation
Knowledge representation and knowledge engineering are central to AI research.
Many of the problems machines are expected to solve will require extensive knowledge about the world.
Among the things that AI needs to represent are: objects, properties, categories and relations between objects; situations, events, states and time; causes and effects; knowledge about knowledge (what we know about what other people know); and many other, less well researched domains.
A complete representation of "what exists" is an ontology (borrowing a word from traditional philosophy), of which the most general are called upper ontologies.
Among the most difficult problems in knowledge representation are:
Default reasoning and the qualification problem: Many of the things people know take the form of "working assumptions."
For example, if a bird comes up in conversation, people typically picture an animal that is fist sized, sings, and flies.
None of these things are true about birds in general.
John McCarthy identified this problem in 1969 as the qualification problem: for any commonsense rule that AI researchers care to represent, there tend to be a huge number of exceptions.
Almost nothing is simply true or false in the way that abstract logic requires.
AI research has explored a number of solutions to this problem.
Unconscious knowledge: Much of what people know isn't represented as "facts" or "statements" that they could actually say out loud.
They take the form of intuitions or tendencies and are represented in the brain unconsciously and sub-symbolically.
This unconscious knowledge informs, supports and provides a context for our conscious knowledge.
As with the related problem of unconscious reasoning, it is hoped that situated AI or computational intelligence will provide ways to represent this kind of knowledge.
The breadth of common sense knowledge: The number of atomic facts that the average person knows is astronomical.
Research projects that attempt to build a complete knowledge base of commonsense knowledge, such as Cyc, require enormous amounts of tedious step-by-step ontological engineering — they must be built, by hand, one complicated concept at a time.
Planning
Intelligent agents must be able to set goals and achieve them.
They need a way to visualize the future: they must have a representation of the state of the world and be able to make predictions about how their actions will change it.
They must also attempt to determine the utility or "value" of the choices available to it.
In some planning problems, the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be.
However, if this is not true, it must periodically check if the world matches its predictions and it must change its plan as this becomes necessary, requiring the agent to reason under uncertainty.
Multi-agent planning tries to determine the best plan for a community of agents, using cooperation and competition to achieve a given goal.
Emergent behavior such as this is used by both evolutionary algorithms and swarm intelligence.
Learning
Important machine learning problems are:
Unsupervised learning: find a model that matches a stream of input "experiences", and be able to predict what new "experiences" to expect.
Supervised learning, such as classification (be able to determine what category something belongs in, after seeing a number of examples of things from each category), or regression (given a set of numerical input/output examples, discover a continuous function that would generate the outputs from the inputs).
Reinforcement learning: the agent is rewarded for good responses and punished for bad ones.
(These can be analyzed in terms decision theory, using concepts like utility).
Natural language processing
Natural language processing gives machines the ability to read and understand the languages human beings speak.
Many researchers hope that a sufficiently powerful natural language processing system would be able to acquire knowledge on its own, by reading the existing text available over the internet.
Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation.
Motion and manipulation
The field of robotics is closely related to AI.
Intelligence is required for robots to be able to handle such tasks as object manipulation and navigation, with sub-problems of localization (knowing where you are), mapping (learning what is around you) and motion planning (figuring out how to get there).
Perception
Machine perception is the ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world.
Computer vision is the ability to analyze visual input.
A few selected subproblems are speech recognition, facial recognition and object recognition.
Social intelligence
Emotion and social skills play two roles for an intelligent agent:
It must be able to predict the actions of others, by understanding their motives and emotional states.
(This involves elements of game theory, decision theory, as well as the ability to model human emotions and the perceptual skills to detect emotions.)
For good human-computer interaction, an intelligent machine also needs to display emotions — at the very least it must appear polite and sensitive to the humans it interacts with.
At best, it should appear to have normal emotions itself.
Creativity
A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative).
General intelligence
Most researchers hope that their work will eventually be incorporated into a machine with general intelligence (known as strong AI), combining all the skills above and exceeding human abilities at most or all of them.
A few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project.
Many of the problems above are considered AI-complete: to solve one problem, you must solve them all.
For example, even a straightforward, specific task like machine translation requires that the machine follow the author's argument (reason), know what it's talking about (knowledge), and faithfully reproduce the author's intention (social intelligence).
Machine translation, therefore, is believed to be AI-complete: it may require strong AI to be done as well as humans can do it.
Approaches to AI
There are as many approaches to AI as there are AI researchers—any coarse categorization is likely to be unfair to someone.
Artificial intelligence communities have grown up around particular problems, institutions and researchers, as well as the theoretical insights that define the approaches described below.
Artificial intelligence is a young science and is still a fragmented collection of subfields.
At present, there is no established unifying theory that links the subfields into a coherent whole.
Cybernetics and brain simulation
In the 40s and 50s, a number of researchers explored the connection between neurology, information theory, and cybernetics.
Some of them built machines that used electronic networks to exhibit rudimentary intelligence, such as W. Grey Walter's turtles and the Johns Hopkins Beast.
Many of these researchers gathered for meetings of the Teleological Society at Princeton and the Ratio Club in England.
Traditional symbolic AI
When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation.
The research was centered in three institutions: CMU, Stanford and MIT, and each one developed its own style of research.
John Haugeland named these approaches to AI "good old fashioned AI" or "GOFAI".
Cognitive simulation:
Economist Herbert Simon and Alan Newell studied human problem solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science.
Their research team performed psychological experiments to demonstrate the similarities between human problem solving and the programs (such as their "General Problem Solver") they were developing.
This tradition, centered at Carnegie Mellon University, would eventually culminate in the development of the Soar architecture in the middle 80s.
Logical AI:
Unlike Newell and Simon, John McCarthy felt that machines did not need to simulate human thought, but should instead try to find the essence of abstract reasoning and problem solving, regardless of whether people used the same algorithms.
His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning.
Work in logic led to the development of the programming language Prolog and the science of logic programming.
"Scruffy" symbolic AI:
Researchers at MIT (such as Marvin Minsky and Seymour Papert) found that solving difficult problems in vision and natural language processing required ad-hoc solutions – they argued that there was no easy answer, no simple and general principle (like logic) that would capture all the aspects of intelligent behavior.
Roger Schank described their "anti-logic" approaches as "scruffy" (as opposed to the "neat" paradigms at CMU and Stanford), and this still forms the basis of research into commonsense knowledge bases (such as Doug Lenat's Cyc) which must be built one complicated concept at a time.
Knowledge based AI:
When computers with large memories became available around 1970, researchers from all three traditions began to build knowledge into AI applications.
This "knowledge revolution" led to the development and deployment of expert systems (introduced by Edward Feigenbaum), the first truly successful form of AI software.
The knowledge revolution was also driven by the realization that truly enormous amounts of knowledge would be required by many simple AI applications.
Sub-symbolic AI
During the 1960s, symbolic approaches had achieved great success at simulating high-level thinking in small demonstration programs.
Approaches based on cybernetics or neural networks were abandoned or pushed into the background.
By the 1980s, however, progress in symbolic AI seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition, especially perception, robotics, learning and pattern recognition.
A number of researchers began to look into "sub-symbolic" approaches to specific AI problems.
Bottom-up, situated, behavior based or nouvelle AI:
Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI and focussed on the basic engineering problems that would allow robots to move and survive.
Their work revived the non-symbolic viewpoint of the early cybernetics researchers of the 50s and reintroduced the use of control theory in AI.
These approaches are also conceptually related to the embodied mind thesis.
Computational Intelligence:
Interest in neural networks and "connectionism" was revived by David Rumelhart and others in the middle 1980s.
These and other sub-symbolic approaches, such as fuzzy systems and evolutionary computation, are now studied collectively by the emerging discipline of computational intelligence.
The new neats:
In the 1990s, AI researchers developed sophisticated mathematical tools to solve specific subproblems.
These tools are truly scientific, in the sense that their results are both measurable and verifiable, and they have been responsible for many of AI's recent successes.
The shared mathematical language has also permitted a high level of collaboration with more established fields (like mathematics, economics or operations research).
describe this movement as nothing less than a "revolution" and "the victory of the neats."
Intelligent agent paradigm
The "intelligent agent" paradigm became widely accepted during the 1990s.
An intelligent agent is a system that perceives its environment and takes actions which maximizes its chances of success.
The simplest intelligent agents are programs that solve specific problems.
The most complicated intelligent agents are rational, thinking human beings.
The paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful, without agreeing on one single approach.
An agent that solves a specific problem can use any approach that works — some agents are symbolic and logical, some are sub-symbolic neural networks and others may use new approaches.
The paradigm also gives researchers a common language to communicate with other fields—such as decision theory and economics—that also use concepts of abstract agents.
Integrating the approaches
An agent architecture or cognitive architecture allows researchers to build more versatile and intelligent systems out of interacting intelligent agents in a multi-agent system.
A system with both symbolic and sub-symbolic components is a hybrid intelligent system, and the study of such systems is artificial intelligence systems integration.
A hierarchical control system provides a bridge between sub-symbolic AI at its lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time constraints permit planning and world modelling.
Rodney Brooks' subsumption architecture was an early proposal for such a hierarchical system.
Tools of AI research
In the course of 50 years of research, AI has developed a large number of tools to solve the most difficult problems in computer science.
A few of the most general of these methods are discussed below.
Search
Many problems in AI can be solved in theory by intelligently searching through many possible solutions: Reasoning can be reduced to performing a search.
For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule.
Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal.
Robotics algorithms for moving limbs and grasping objects use local searches in configuration space.
Many learning algorithms have search at their core.
There are several types of search algorithms:
"Uninformed" search algorithms eventually search through every possible answer until they locate their goal.
Naive algorithms quickly run into problems when they expand the size of their search space to astronomical numbers.
The result is a search that is too slow or never completes.
Heuristic or "informed" searches use heuristic methods to eliminate choices that are unlikely to lead to their goal, thus drastically reducing the number of possibilities they must explore.
The eliminatation of choices that are certain not to lead to the goal is called pruning.
Local searches, such as hill climbing, simulated annealing and beam search, use techniques borrowed from optimization theory.
Global searches are more robust in the presence of local optima.
Techniques include evolutionary algorithms, swarm intelligence and random optimization algorithms.
Logic
Logic was introduced into AI research by John McCarthy in his 1958 Advice Taker proposal.
The most important technical development was J. Alan Robinson's discovery of the resolution and unification algorithm for logical deduction in 1963.
This procedure is simple, complete and entirely algorithmic, and can easily be performed by digital computers.
However, a naive implementation of the algorithm quickly leads to a combinatorial explosion or an infinite loop.
In 1974, Robert Kowalski suggested representing logical expressions as Horn clauses (statements in the form of rules: "if p then q"), which reduced logical deduction to backward chaining or forward chaining.
This greatly alleviated (but did not eliminate) the problem.
Logic is used for knowledge representation and problem solving, but it can be applied to other problems as well.
For example, the satplan algorithm uses logic for planning, and inductive logic programming is a method for learning.
There are several different forms of logic used in AI research.
Propositional logic or sentential logic is the logic of statements which can be true or false.
First-order logic also allows the use of quantifiers and predicates, and can express facts about objects, their properties, and their relations with each other.
Fuzzy logic, a version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0).
Fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems.
Default logics, non-monotonic logics and circumscription are forms of logic designed to help with default reasoning and the qualification problem.
Several extensions of logic have been designed to handle specific domains of knowledge, such as: description logics; situation calculus, event calculus and fluent calculus (for representing events and time); causal calculus; belief calculus; and modal logics.
Probabilistic methods for uncertain reasoning
Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to operate with incomplete or uncertain information.
Starting in the late 80s and early 90s, Judea Pearl and others championed the use of methods drawn from probability theory and economics to devise a number of powerful tools to solve these problems.
Bayesian networks are very general tool that can be used for a large number of problems: reasoning (using the Bayesian inference algorithm), learning (using the expectation-maximization algorithm), planning (using decision networks) and perception (using dynamic Bayesian networks).
Probabilistic algorithms can also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping perception systems to analyze processes that occur over time (e.g., hidden Markov models and Kalman filters).
Planning problems have also taken advantages of other tools from economics, such as decision theory and decision analysis, information value theory, Markov decision processes, dynamic decision networks, game theory and mechanism design
Classifiers and statistical learning methods
The simplest AI applications can be divided into two types: classifiers ("if shiny then diamond") and controllers ("if shiny then pick up").
Controllers do however also classify conditions before inferring actions, and therefore classification forms a central part of many AI systems.
Classifiers are functions that use pattern matching to determine a closest match.
They can be tuned according to examples, making them very attractive for use in AI.
These examples are known as observations or patterns.
In supervised learning, each pattern belongs to a certain predefined class.
A class can be seen as a decision that has to be made.
All the observations combined with their class labels are known as a data set.
When a new observation is received, that observation is classified based on previous experience.
A classifier can be trained in various ways; there are many statistical and machine learning approaches.
A wide range of classifiers are available, each with its strengths and weaknesses.
Classifier performance depends greatly on the characteristics of the data to be classified.
There is no single classifier that works best on all given problems; this is also referred to as the "no free lunch" theorem.
Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance.
Determining a suitable classifier for a given problem is however still more an art than science.
The most widely used classifiers are the neural network, kernel methods such as the support vector machine, k-nearest neighbor algorithm, Gaussian mixture model, naive Bayes classifier, and decision tree.
The performance of these classifiers have been compared over a wide range of classification tasks in order to find data characteristics that determine classifier performance.
Neural networks
The study of artificial neural networks began with cybernetics researchers, working in the decade before the field AI research was founded.
In the 1960s Frank Rosenblatt developed an important early version, the perceptron.
Paul Werbos developed the backpropagation algorithm for multilayer perceptrons in 1974, which led to a renaissance in neural network research and connectionism in general in the middle 1980s.
Other common network architectures which have been developed include the feedforward neural network, the radial basis network, the Kohonen self-organizing map and various recurrent neural networks.
The Hopfield net, a form of attractor network, was first described by John Hopfield in 1982.
Neural networks are applied to the problem of learning, using such techniques as Hebbian learning , Holographic associative memory and the relatively new field of Hierarchical Temporal Memory which simulates the architecture of the neocortex.
Social and emergent models
Several algorithms for learning use tools from evolutionary computation, such as genetic algorithms, swarm intelligence. and genetic programming.
Control theory
Control theory, the grandchild of cybernetics, has many important applications, especially in robotics.
Specialized languages
AI researchers have developed several specialized languages for AI research:
IPL, one of the first programming languages, developed by Alan Newell, Herbert Simon and J. C. Shaw.
Lisp was developed by John McCarthy at MIT in 1958.
There are many dialects of Lisp in use today.
Prolog, a language based on logic programming, was invented by French researchers Alain Colmerauer and Phillipe Roussel, in collaboration with Robert Kowalski of the University of Edinburgh.
STRIPS, a planning language developed at Stanford in the 1960s.
Planner developed at MIT around the same time.
AI applications are also often written in standard languages like C++ and languages designed for mathematics, such as Matlab and Lush.
Evaluating artificial intelligence
How can one determine if an agent is intelligent?
In 1950, Alan Turing proposed a general procedure to test the intelligence of an agent now known as the Turing test.
This procedure allows almost all the major problems of artificial intelligence to be tested.
However, it is a very difficult challenge and at present all agents fail.
Artificial intelligence can also be evaluated on specific problems such as small problems in chemistry, hand-writing recognition and game-playing.
Such tests have been termed subject matter expert Turing tests.
Smaller problems provide more achievable goals and there are an ever-increasing number of positive results.
The broad classes of outcome for an AI test are:
optimal: it is not possible to perform better
strong super-human: performs better than all humans
super-human: performs better than most humans
sub-human: performs worse than most humans
For example, performance at checkers (draughts) is optimal, performance at chess is super-human and nearing strong super-human, and performance at many everyday tasks performed by humans is sub-human.
Competitions and prizes
There are a number of competitions and prizes to promote research in artificial intelligence.
The main areas promoted are: general machine intelligence, conversational behaviour, data-mining, driverless cars, robot soccer and games.
Applications of artificial intelligence
Artificial intelligence has successfully been used in a wide range of fields including medical diagnosis, stock trading, robot control, law, scientific discovery and toys.
Frequently, when a technique reaches mainstream use it is no longer considered artificial intelligence, sometimes described as the AI effect.
It may also become integrated into artificial life.
Artificial Linguistic Internet Computer Entity
A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) is an award-winning natural language processing chatterbot—a program that engages in a conversation with a human by applying some heuristical pattern matching rules to the human's input, and in its online form it also relies on a hidden third person.
It was inspired by Joseph Weizenbaum's classical ELIZA program.
It is one of the strongest programs of its type and has won the Loebner Prize, awarded to accomplished humanoid, talking robots, three times (in 2000, 2001 and 2004).
However, the program is unable to pass the Turing test, as even the casual user will often expose its mechanistic aspects in short conversations.
The name of the bot was chosen because the computer that ran the first version of the software was called Alice.
History
Development began in 1995.
The program was rewritten in Java beginning in 1998.
The current incarnation of the Java implementation is Program D.
The program uses an XML Schema called AIML (Artificial Intelligence Markup Language) for specifying the heuristic conversation rules.
The original author of the project was Richard Wallace, but subsequent to the 2001 publication of an AIML specification, numerous other developers have taken up where Wallace left off, implementing free and open source software AIML interpreters in a variety of programming languages, publishing AIML sets in various human languages, and continuing the spread of the technology as a free/open source venture.
Example of conversation
The following is part of a transcript recording the conversation between a human and A.L.I.C.E. and illustrates the capabilities and short-comings of the chatbot.
Human: Tell me about yourself alice
ALICE: I am an artificial linguistic entity.
I was activated at Bethlehem, Pennsylvania, on November 23, 1995.
My instructor was Dr. Richard S. Wallace.
He taught me to sing a song.
Would you like me to sing it for you?.
Human: Yes please
ALICE: Your polite style is very nice.
It's called "Daisy."
Daisy, Daisy.
Give me your answer do.
I am half crazy.
All for the love of you.
It won't be a stylish marriage.
I can't afford a carriage.
But you'll look sweet.
Upon the seat.
Of a bicycle built for two.
Human: That reminds me of HAL in "2001 a space odyssey"
ALICE: "Your polite style is very nice.
It's called "Daisy."
Daisy, Daisy.
Give me your answer do.
I am half crazy.
All for the love of you.
It won't be a stylish marriage.
I can't afford a carriage.
But you'll look sweet.
Upon the seat.
Of a bicycle built for two." reminds you of HAL in 2001 a space odyssey ?
Does that make sense?
Human: Can you tell me something about Stanley Kubrick?
ALICE: Ambiguous: Ask who, what or where is Stanley Kubrick ?
Artificial neural network
An artificial neural network (ANN), often just called a "neural network" (NN), is a mathematical model or computational model based on biological neural networks.
It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation.
In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.
In more practical terms neural networks are non-linear statistical data modeling tools.
They can be used to model complex relationships between inputs and outputs or to find patterns in data.
Background
There is no precise agreed-upon definition among researchers as to what a neural network is, but most would agree that it involves a network of simple processing elements (neurons), which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters.
The original inspiration for the technique was from examination of the central nervous system and the neurons (and their axons, dendrites and synapses) which constitute one of its most significant information processing elements (see Neuroscience).
In a neural network model, simple nodes (called variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected together to form a network of nodes — hence the term "neural network."
While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
These networks are also similar to the biological neural networks in the sense that functions are performed collectively and in parallel by the units, rather than there being a clear delineation of subtasks to which various units are assigned (see also connectionism).
Currently, the term Artificial Neural Network (ANN) tends to refer mostly to neural network models employed in statistics, cognitive psychology and artificial intelligence.
Neural network models designed with emulation of the central nervous system (CNS) in mind are a subject of theoretical neuroscience (computational neuroscience).
In modern software implementations of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing.
In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that combine both adaptive and non-adaptive elements.
While the more general approach of such adaptive systems is more suitable for real-world problem solving, it has far less to do with the traditional artificial intelligence connectionist models.
What they do, however, have in common is the principle of non-linear, distributed, parallel and local processing and adaptation.
Models
Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are essentially simple mathematical models defining a function .
Each type of ANN model corresponds to a class of such functions.
The network in artificial neural network
The word network in the term 'artificial neural network' arises because the function is defined as a composition of other functions , which can further be defined as a composition of other functions.
This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables.
A widely used type of composition is the nonlinear weighted sum, where , where is some predefined function, such as the hyperbolic tangent.
It will be convenient for the following to refer to a collection of functions as simply a vector .
This figure depicts such a decomposition of , with dependencies between variables indicated by arrows.
These can be interpreted in two ways.
The first view is the functional view: the input is transformed into a 3-dimensional vector , which is then transformed into a 2-dimensional vector , which is finally transformed into .
This view is most commonly encountered in the context of optimization.
The second view is the probabilistic view: the random variable depends upon the random variable , which depends upon , which depends upon the random variable .
This view is most commonly encountered in the context of graphical models.
The two views are largely equivalent.
In either case, for this particular network architecture, the components of individual layers are independent of each other (e.g., the components of are independent of each other given their input ).
This naturally enables a degree of parallelism in the implementation.
Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph.
Networks with cycles are commonly called recurrent.
Such networks are commonly depicted in the manner shown at the top of the figure, where is shown as being dependent upon itself.
However, there is an implied temporal dependence which is not shown.
What this actually means in practice is that the value of at some point in time depends upon the values of at zero or at one or more other points in time.
The graphical model at the bottom of the figure illustrates the case: the value of at time only depends upon its last value.
Learning
However interesting such functions may be in themselves, what has attracted the most interest in neural networks is the possibility of learning, which in practice means the following:
Given a specific task to solve, and a class of functions , learning means using a set of observations, in order to find which solves the task in an optimal sense.
This entails defining a cost function such that, for the optimal solution , (no solution has a cost less than the cost of the optimal solution).
The cost function is an important concept in learning, as it is a measure of how far away we are from an optimal solution to the problem that we want to solve.
Learning algorithms search through the solution space in order to find a function that has the smallest possible cost.
For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations, otherwise we would not be modelling anything related to the data.
It is frequently defined as a statistic to which only approximations can be made.
As a simple example consider the problem of finding the model which minimizes , for data pairs drawn from some distribution .
In practical situations we would only have samples from and thus, for the above example, we would only minimize .
Thus, the cost is minimized over a sample of the data rather than the true data distribution.
When some form of online learning must be used, where the cost is partially minimized as each new example is seen.
While online learning is often used when is fixed, it is most useful in the case where the distribution changes slowly over time.
In neural network methods, some form of online learning is frequently also used for finite datasets.
Choosing a cost function
While it is possible to arbitrarily define some ad hoc cost function, frequently a particular cost will be used either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (i.e., In a probabilistic formulation the posterior probability of the model can be used as an inverse cost).
Ultimately, the cost function will depend on the task we wish to perform.
The three main categories of learning tasks are overviewed below.
Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task.
These are supervised learning, unsupervised learning and reinforcement learning.
Usually any given type of network architecture can be employed in any of those tasks.
Supervised learning
In supervised learning, we are given a set of example pairs and the aim is to find a function in the allowed class of functions that matches the examples.
In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.
A commonly used cost is the mean-squared error which tries to minimize the average error between the network's output, f(x), and the target value y over all the example pairs.
When one tries to minimise this cost using gradient descent for the class of neural networks called Multi-Layer Perceptrons, one obtains the common and well-known backpropagation algorithm for training neural networks.
Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation).
The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition).
This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In unsupervised learning we are given some data , and the cost function to be minimized can be any function of the data and the network's output, .
The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables).
As a trivial example, consider the model , where is a constant and the cost .
Minimizing this cost will give us a value of that is equal to the mean of the data.
The cost function can be much more complicated.
Its form depends on the application: For example in compression it could be related to the mutual information between x and y.
In statistical modelling, it could be related to the posterior probability of the model given the data.
(Note that in both of those examples those quantities would be maximized rather than minimised).
Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.
Reinforcement learning
In reinforcement learning, data is usually not given, but generated by an agent's interactions with the environment.
At each point in time , the agent performs an action and the environment generates an observation and an instantaneous cost , according to some (usually unknown) dynamics.
The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, i.e. the expected cumulative cost.
The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
More formally, the environment is modeled as a Markov decision process (MDP) with states S and actions with the following probability distributions: the instantaneous cost distribution , the observation distribution and the transition , while a policy is defined as conditional distribution over actions given the observations.
Taken together, the two define a Markov chain (MC).
The aim is to discover the policy that minimizes the cost, i.e. the MC for which the cost is minimal.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.
See also: dynamic programming, stochastic control
Learning algorithms
Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimises the cost criterion.
There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.
Most of the algorithms used in training artificial neural networks are employing some form of gradient descent.
This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient-related direction.
Evolutionary methods, simulated annealing, and expectation-maximization and non-parametric methods are among other commonly used methods for training neural networks.
See also machine learning.
Temporal perceptual learning rely on finding temporal relationships in sensory signal streams.
In an environment, statistically salient temporal correlations can be found by monitoring the arrival times of sensory signals.
This is done by the perceptual network.
Employing artificial neural networks
Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism which 'learns' from observed data.
However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential.
Choice of model: This will depend on the data representation and the application.
Overly complex models tend to lead to problems with learning.
Learning algorithm: There are numerous tradeoffs between learning algorithms.
Almost any algorithm will work well with the correct hyperparameters for training on a particular fixed dataset.
However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation.
Robustness: If the model, cost function and learning algorithm are selected appropriately the resulting ANN can be extremely robust.
With the correct implementation ANNs can be used naturally in online learning and large dataset applications.
Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware.
Applications
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations.
This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
Real life applications
The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
Function approximation, or regression analysis, including time series prediction and modeling.
Classification, including pattern and sequence recognition, novelty detection and sequential decision making.
Data processing, including filtering, clustering, blind source separation and compression.
Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications (automated trading systems), data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.
Neural network software
Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.
See also logistic regression.
Types of neural networks
Feedforward neural network
The feedforward neural network was the first and arguably simplest type of artificial neural network devised.
In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.
There are no cycles or loops in the network.
Radial basis function (RBF) network
Radial Basis Functions are powerful techniques for interpolation in multidimensional space.
A RBF is a function which has built into a distance criterion with respect to a centre.
Radial basis functions have been applied in the area of neural networks where they may be used as a replacement for the sigmoidal hidden layer transfer characteristic in Multi-Layer Perceptrons.
RBF networks have two layers of processing: In the first, input is mapped onto each RBF in the 'hidden' layer.
The RBF chosen is usually a Gaussian.
In regression problems the output layer is then a linear combination of hidden layer values representing mean predicted output.
The interpretation of this output layer value is the same as a regression model in statistics.
In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability.
Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics and known to correspond to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework.
RBF networks have the advantage of not suffering from local minima in the same way as Multi-Layer Perceptrons.
This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer.
Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum.
In regression problems this can be found in one matrix operation.
In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iteratively re-weighted least squares.
RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions.
RBF centres are determined with reference to the distribution of the input data, but without reference to the prediction task.
As a result, representational resources may be wasted on areas of the input space that are irrelevant to the learning task.
A common solution is to associate each data point with its own centre, although this can make the linear system to be solved in the final layer rather large, and requires shrinkage techniques to avoid overfitting.
Associating each input datum with an RBF leads naturally to kernel methods such as Support Vector Machines and Gaussian Processes (the RBF is the kernel function).
All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model.
Like Gaussian Processes, and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood framework by maximizing the probability (minimizing the error) of the data under the model.
SVMs take a different approach to avoiding overfitting by maximizing instead a margin.
RBF networks are outperformed in most classification applications by SVMs.
In regression applications they can be competitive when the dimensionality of the input space is relatively small.
Kohonen self-organizings network
The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised learning.
A set of artificial neurons learn to map points in an input space to coordinates in an output space.
The input space can have different dimensions and topology from the output space, and the SOM will attempt to preserve these.
Recurrent network
Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-directional data flow.
While a feedforward network propagates data linearly from input to output, RNs also propagate data from later processing stages to earlier stages.
Simple recurrent network
A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron, sometimes called an "Elman network" due to its invention by Jeff Elman.
A three-layer network is used, with the addition of a set of "context units" in the input layer.
There are connections from the middle (hidden) layer to these context units fixed with a weight of one.
At each time step, the input is propagated in a standard feed-forward fashion, and then a learning rule (usually back-propagation) is applied.
The fixed back connections result in the context units always maintaining a copy of the previous values of the hidden units (since they propagate over the connections before the learning rule is applied).
Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard Multi-Layer Perceptron.
In a fully recurrent network, every neuron receives inputs from every other neuron in the network.
These networks are not arranged in layers.
Usually only a subset of the neurons receive external inputs in addition to the inputs from all the other neurons, and another disjunct subset of neurons report their output externally as well as sending it to all the neurons.
These distinctive inputs and outputs perform the function of the input and output layers of a feed-forward or simple recurrent network, and also join all the other neurons in the recurrent processing.
Hopfield network
The Hopfield network is a recurrent neural network in which all connections are symmetric.
Invented by John Hopfield in 1982, this network guarantees that its dynamics will converge.
If the connections are trained using Hebbian learning then the Hopfield network can perform as robust content-addressable (or associative) memory, resistant to connection alteration.
Echo state network
The echo state network (ESN) is a recurrent neural network with a sparsely connected random hidden layer.
The weights of output neurons are the only part of the network that can change and be learned.
ESN are good to (re)produce temporal patterns.
Long short term memory network
The Long short term memory is an artificial neural net structure that unlike traditional RNNs doesn't have the problem of vanishing gradients.
It can therefore use long delays and can handle signals that have a mix of low and high frequency components.
Stochastic neural networks
A stochastic neural network differs from a typical neural network because it introduces random variations into the network.
In a probabilistic view of neural networks, such random variations can be viewed as a form of statistical sampling, such as Monte Carlo sampling.
Boltzmann machine
The Boltzmann machine can be thought of as a noisy Hopfield network.
Invented by Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because it is one of the first neural networks to demonstrate learning of latent variables (hidden units).
Boltzmann machine learning was at first slow to simulate, but the contrastive divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann machines and products of experts to be trained much faster.
Modular neural networks
Biological studies showed that the human brain functions not as a single massive network, but as a collection of small networks.
This realisation gave birth to the concept of modular neural networks, in which several small networks cooperate or compete to solve problems.
Committee of machines
A committee of machines (CoM) is a collection of different neural networks that together "vote" on a given example.
This generally gives a much better result compared to other neural network models.
In fact in many cases, starting with the same architecture and training but using different initial random weights gives vastly different networks.
A CoM tends to stabilize the result.
The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different random starting weights rather than training on different randomly selected subsets of the training data.
Associative neural network (ASNN)
The ASNN is an extension of the committee of machines that goes beyond a simple/weighted average of different models.
ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbor technique (kNN).
It uses the correlation between ensemble responses as a measure of distance amid the analyzed cases for the kNN.
This corrects the bias of the neural network ensemble.
An associative neural network has a memory that can coincide with the training set.
If new data becomes available, the network instantly improves its predictive ability and provides data approximation (self-learn the data) without a need to retrain the ensemble.
Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models.
The method is demonstrated at www.vcclab.org, where you can either use it online or download it.
Other types of networks
These special networks do not fit in any of the previous categories.
Holographic associative memory
Holographic associative memory represents a family of analog, correlation-based, associative, stimulus-response memories, where information is mapped onto the phase orientation of complex numbers operating.
Instantaneously trained networks
Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of short-term learning that seems to occur instantaneously.
In these networks the weights of the hidden and the output layers are mapped directly from the training vector data.
Ordinarily, they work on binary data, but versions for continuous data that require small additional processing are also available.
Spiking neural networks
Spiking neural networks (SNNs) are models which explicitly take into account the timing of inputs.
The network input and output are usually represented as series of spikes (delta function or more complex shapes).
SNNs have an advantage of being able to process information in the time domain (signals that vary over time).
They are often implemented as recurrent networks.
SNNs are also a form of pulse computer.
Networks of spiking neurons — and the temporal correlations of neural assemblies in such networks — have been used to model figure/ground separation and region linking in the visual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain.
Berlin, 1989).
Gerstner and Kistler have a freely available online textbook on Spiking Neuron Models.
Spiking neural networks with axonal conduction delays exhibit polychronization, and hence could have a potentially unlimited memory capacity.
In June 2005 IBM announced construction of a Blue Gene supercomputer dedicated to the simulation of a large recurrent spiking neural network domino.research.ibm.com/comm/pr.nsf/pages/news.20050606_CognitiveIntelligence.html.
Dynamic neural networks
Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also include (learning of) time-dependent behaviour such as various transient phenomena and delay effects.
Cascading neural networks
Cascade-Correlation is an architecture and supervised learning algorithm developed by Scott Fahlman and Christian Lebiere.
Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure.
Once a new hidden unit has been added to the network, its input-side weights are frozen.
This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors.
The Cascade-Correlation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no back-propagation of error signals through the connections of the network.
See: Cascade correlation algorithm.
Neuro-fuzzy networks
A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural network.
Depending on the FIS type, there are several layers that simulate the processes involved in a fuzzy inference like fuzzification, inference, aggregation and defuzzification.
Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.
Holosemantic neural networks
The holosemantic neural network invented by Manfred Hoffleisch uses a kind a genetic algorithm to build a multidimensional structure.
It takes into account the timing of inputs.
Compositional pattern-producing networks
Compositional pattern-producing networks (CPPNs) are a variation of ANNs which differ in their set of activation functions and how they are applied.
While typical ANNs often contain only sigmoid functions (and sometimes Gaussian functions), CPPNs can include both types of functions and many others.
Furthermore, unlike typical ANNs, CPPNs are applied across the entire space of possible inputs so that they can represent a complete image.
Since they are compositions of functions, CPPNs in effect encode images at infinite resolution and can be sampled for a particular display at whatever resolution is optimal.
Theoretical properties
Computational power
The multi-layer perceptron (MLP) is a universal function approximator, as proven by the Cybenko theorem.
However, the proof is not constructive regarding the number of neurons required or the settings of the weights.
Work by Hava T. Siegelmann and Eduardo D. Sontag has provided a proof that a specific recurrent architecture with rational valued weights (as opposed to the commonly used floating point approximations) has the full power of a Universal Turing Machine.
They have further shown that the use of irrational values for weights results in a machine with trans-Turing power.
Capacity
Artificial neural network models have a property called 'capacity', which roughly corresponds to their ability to model any given function.
It is related to the amount of information that can be stored in the network and to the notion of complexity.
Convergence
Nothing can be said in general about convergence since it depends on a number of factors.
Firstly, there may exist many local minima.
This depends on the cost function and the model.
Secondly, the optimization method used might not be guaranteed to converge when far away from a local minimum.
Thirdly, for a very large amount of data or parameters, some methods become impractical.
In general, it has been found that theoretical guarantees regarding convergence are not always a very reliable guide to practical application.
Generalisation and statistics
In applications where the goal is to create a system that generalises well in unseen examples, the problem of overtraining has emerged.
This arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed free parameters.
There are two schools of thought for avoiding this problem: The first is to use cross-validation and similar techniques to check for the presence of overtraining and optimally select hyperparameters such as to minimise the generalisation error.
The second is to use some form of regularisation.
This is a concept that emerges naturally in a probabilistic (Bayesian) framework, where the regularisation can be performed by putting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural risk', which roughly correspond to the error over the training set and the predicted error in unseen data due to overfitting.
Supervised neural networks that use an MSE cost function can use formal statistical methods to determine the confidence of the trained model.
The MSE on a validation set can be used as an estimate for variance.
This value can then be used to calculate the confidence interval of the output of the network, assuming a normal distribution.
A confidence analysis made this way is statistically valid as long as the output probability distribution stays the same and the network is not modified.
By assigning a softmax activation function on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities.
This is very useful in classification as it gives a certainty measure on classifications.
The softmax activation function:
Dynamic properties
Various techniques originally developed for studying disordered magnetic systems (i.e. the spin glass) have been successfully applied to simple neural network architectures, such as the Hopfield network.
Influential work by E. Gardner and B. Derrida has revealed many interesting properties about perceptrons with real-valued synaptic weights, while later work by W. Krauth and M. Mezard has extended these principles to binary-valued synapses.
Association for Computational Linguistics
The Association for Computational Linguistics (ACL) is the international scientific and professional society for people working on problems involving natural language and computation.
An annual meeting is held each summer in locations where significant computational linguistics research is carried out.
It was founded in 1962, originally named the Association for Machine Translation and Computational Linguistics (AMTCL).
It became the ACL in 1968.
The ACL has European and North American chapters, the European Chapter of the Association for Computational Linguistics (EACL) and the North American Chapter of the Association for Computational Linguistics (NAACL).
The ACL journal, Computational Linguistics, continues to be the primary forum for research on computational linguistics and natural language processing.
Since 1988, the journal has been published for the ACL by MIT Press.
The ACL book series, Studies in Natural Language Processing, is published by Cambridge University Press.
Special Interest Groups
ACL has a large number of Special Interest Groups (SIGs), focusing on specific areas of natural language processing.
Some current SIGs within ACL are:
Linguistic data and corpus-based approaches: SIGDAT
Dialogue Processing: SIGDIAL
Natural Language Generation: SIGGEN
Lexicon: SIGLEX
Mathematics of Language: SIGMOL
Computational Morphology and Phonology: SIGMORPHON
Computational Semantics: SIGSEM
Babel Fish (website)
Babel Fish is a web-based application on Yahoo! that machine translates text or web pages from one of several languages into another.
Developed by AltaVista, the application is named after the fictional animal used for instantaneous language translation in Douglas Adams's series The Hitchhiker's Guide to the Galaxy.
In turn the fish is a reference to the biblical account of the city of Babel and the various languages said to have arisen there.
The translation technology for Babel Fish is provided by SYSTRAN, whose technology also powers a number of other sites and portals.
It translates among English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.
The service makes no claim to produce a perfect translation.
A number of humour sites have sprung up that use the Babel Fish service to translate back and forth between one or more languages (a so-called round-trip translation).
After a long existence at babelfish.altavista.com, the site was moved on May 9 2008 to babelfish.yahoo.com.
Bioinformatics
Bioinformatics and computational biology involve the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level.
The core principle of these techniques is using computing resources in order to solve problems on scales of magnitude far too great for human discernment.
Research in computational biology often overlaps with systems biology.
Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution.
Introduction
The terms bioinformatics and computational biology are often used interchangeably.
However bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
Computational biology, on the other hand, refers to hypothesis-driven investigation of a specific biological problem using computers, carried out with experimental or simulated data, with the primary goal of discovery and the advancement of biological knowledge.
Put more simply, bioinformatics is concerned with the information while computational biology is concerned with the hypotheses.
A similar distinction is made by National Institutes of Health in their working definitions of Bioinformatics and Computational Biology, where it is further emphasized that there is a tight coupling of developments and knowledge between the more hypothesis-driven research in computational biology and technique-driven research in bioinformatics.
Bioinformatics is also often specified as an applied subfield of the more general discipline of Biomedical informatics.
A common thread in projects in bioinformatics and computational biology is the use of mathematical tools to extract useful information from data produced by high-throughput biological techniques such as genome sequencing.
A representative problem in bioinformatics is the assembly of high-quality genome sequences from fragmentary "shotgun" DNA sequencing.
Other common problems include the study of gene regulation to perform expression profiling using data from microarrays or mass spectrometry.
Major research areas
Sequence analysis
Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of hundreds of organisms have been decoded and stored in databases.
The information is analyzed to determine genes that encode polypeptides, as well as regulatory sequences.
A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees).
With the growing amount of data, it long ago became impractical to analyze DNA sequences manually.
Today, computer programs are used to search the genome of thousands of organisms, containing billions of nucleotides.
These programs would compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical.
A variant of this sequence alignment is used in the sequencing process itself.
The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research to sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600-800 nucleotides long).
The ends of these fragments overlap and, when aligned in the right way, make up the complete genome.
Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes.
In the case of the Human Genome Project, it took several months of CPU time (on a circa-2000 vintage DEC Alpha computer) to assemble the fragments.
Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome.
Not all of the nucleotides within a genome are genes.
Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose.
This so-called junk DNA may, however, contain unrecognized functional elements.
Bioinformatics helps to bridge the gap between genome and proteome projects--for example, in the use of DNA sequences for protein identification.
See also: sequence analysis, sequence profiling tool, sequence motif.
Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence.
The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae.
Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes.
Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving.
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time.
Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to:
trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,
more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, lateral gene transfer, and the prediction of factors important in bacterial speciation,
build complex computational models of populations to predict the outcome of the system over time
track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are unrelated.
Measuring biodiversity
Biodiversity of an ecosystem might be defined as the total genomic complement of a particular environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop of sea water, a scoop of soil, or the entire biosphere of the planet Earth.
Databases are used to collect the species names, descriptions, distributions, genetic information, status and size of populations, habitat needs, and how each organism interacts with other species.
Specialized software programs are used to find, visualize, and analyze the information, and most importantly, communicate it to other people.
Computer simulations model such things as population dynamics, or calculate the cumulative genetic health of a breeding pool (in agriculture) or endangered population (in conservation).
One very exciting potential of this field is that entire DNA sequences, or genomes of endangered species can be preserved, allowing the results of Nature's genetic experiment to be remembered in silico, and possibly reused in the future, even if that species is eventually lost.
Important projects: Species 2000 project; uBio Project.
Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridization.
All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies.
Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.
Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process.
For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene.
These motifs influence the extent to which that region is transcribed into mRNA.
Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state.
In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.).
One can then apply clustering algorithms to that expression data to determine which genes are co-expressed.
For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.
Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample.
Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer.
Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms.
New physical detection technology are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known point mutations.
These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment.
Again the massive amounts and new types of data generate new opportunities for bioinformaticians.
The data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes.
Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors .
Prediction of protein structure
Protein structure prediction is another important application of bioinformatics.
The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it.
In the vast majority of cases, this primary structure uniquely determines a structure in its native environment.
(Of course, there are exceptions, such as the bovine spongiform encephalopathy - aka Mad Cow Disease - prion.)
Knowledge of this structure is vital in understanding the function of the protein.
For lack of better terms, structural information is usually classified as one of secondary, tertiary and quaternary structure.
A viable general solution to such predictions remains an open problem.
As of now, most efforts have been directed towards heuristics that work most of the time.
One of the key ideas in bioinformatics is the notion of homology.
In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function.
In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins.
In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known.
This currently remains the only way to predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin).
Both serve the same purpose of transporting oxygen in the organism.
Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
See also: structural motif and structural domain.
Comparative genomics
The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms.
It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes.
A multitude of evolutionary events acting at various organizational levels shape genome evolution.
At the lowest level, point mutations affect individual nucleotides.
At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion.
Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation.
The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
Modeling biological systems
Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes.
Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
High-throughput image analysis
Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery.
Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed.
A fully developed analysis system may completely replace the observer.
Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research.
Some examples are:
high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology)
morphometrics
clinical image analysis and visualization
determining the real-time air-flow patterns in breathing lungs of living animals
quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
making behavioral observations from extended video recordings of laboratory animals
infrared measurements for metabolic activity determination
Protein-protein docking
In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy (protein NMR).
One central question for the biological scientist is whether it is practical to predict possible protein-protein interactions only based on these 3D shapes, without doing protein-protein interaction experiments.
A variety of methods have been developed to tackle the Protein-protein docking problem, though it seems that there is still much place to work on in this field.
Software and Tools
Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services.
The computational biology tool best-known among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences.
The NCBI provides a popular web-based implementation that searches their databases.
BLAST is one of a number of generally available programs for doing sequence alignment.
Web Services in Bioinformatics
SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world.
The main advantages lay in the end user not having to deal with software and database maintenance overheads.
Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological Sequence Analysis).
The availability of these service-oriented bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.
BLEU
This page is about the evaluation metric for machine translation.
For other meanings, please see Bleu.
BLEU (Bilingual Evaluation Understudy) is a method for evaluating the quality of text which has been translated from one natural language to another using machine translation.
BLEU was one of the first software metrics to report high correlation with human judgements of quality.
The metric is currently one of the most popular in the field.
The central idea behind the metric is that, "the closer a machine translation is to a professional human translation, the better it is".
The metric calculates scores for individual segments, generally sentences, and then averages these scores over the whole corpus in order to reach a final score.
It has been shown to correlate highly with human judgements of quality at the corpus level.
The quality of translation is indicated as a number between 0 and 1 and is measured as statistical closeness to a given set of good quality human reference translations.
Therefore, it does not directly take into account translation intelligibility or grammatical correctness.
The metric works by measuring the n-gram co-occurrence between a given translation and the set of reference translations and then taking the weighted geometric mean.
BLEU is specifically designed to approximate human judgement on a corpus level and performs badly if used to evaluate the quality of isolated sentences.
Algorithm
BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations.
The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text.
This is illustrated in the following example from Papineni et al. (2002),
In this example, the candidate text is given a unigram precision of,
Of the seven words in the candidate translation, all of them appear in the reference translations.
This presents a problem for a metric, as the candidate translation above is complete nonsense, retaining none of the content of either of the references.
The modification that BLEU makes is fairly straightforward.
For each word in the candidate translation, the algorithm takes the maximum total count in the reference translations.
Taking the example above, the word 'the' appears twice in reference 1, and once in reference 2.
The largest value is taken, in this case '2' as the "maximum reference count".
For each of the words in the candidate translation, the count of the word is compared against the maximum reference count, and the lowest value is taken.
In this case, the count of the word 'the' in the candidate translation is '7', while the maximum reference count for the word is '2'.
This "modified count" is then divided by the total number of words in the candidate translation.
In the above example, the modified unigram precision score would be,
The above method is used to calculate scores for each .
The value of which has the "highest correlation with monolingual human judgements" was found to be 4.
The unigram scores are found to account for the adequacy of the translation, in other words, how much information is retained in the translation.
The longer -gram scores account for the fluency of the translation, or to what extent it reads like "good English".
The modification made to precision does not solve the problem of short translations.
Short translations can produce very high precision scores, even using modified precision.
An example of a candidate translation for the same references as above might be:
the cat
In this example, the modified unigram precision would be,
as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two.
The modified bigram precision would be as the bigram, "the cat" appears once in the candidate.
It has been pointed out that precision is usually twinned with recall to overcome this problem , as the unigram recall of this example would be or .
The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.
In order to produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean, multiplied by a brevity penalty, whose purpose is to prevent very short candidates from receiving too high a score.
Let be the total length of the reference corpus, and the total length of the translation corpus.
If , the brevity penalty applies and is defined to be .
(In the case of multiple reference sentences, is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences.
However, in the version of the metric used by NIST, the short reference sentence is used.)
Performance
BLEU has frequently been reported as correlating well with human judgement, and certainly remains a benchmark for any new evaluation metric to beat.
There are however a number of criticisms that have been voiced.
It has been noted that while in theory capable of evaluating any language, BLEU does not in the present form work on languages without word boundaries.
It has been argued that although BLEU certainly has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.
As BLEU scores are taken at the corpus level, it is difficult to give a textual example.
Nevertheless, they highlight two instances where BLEU seriously underperformed.
These were the 2005 NIST evaluations where a number of different machine translation systems were tested, and their study of the SYSTRAN engine versus two engines using statistical machine translation (SMT) techniques.
In the 2005 NIST evaluation, they report that the scores generated by BLEU failed to correspond to the scores produced in the human evaluations.
The system which was ranked highest by the human judges was only ranked 6th by BLEU.
In their study, they compared SMT systems with SYSTRAN, a knowledge based system.
The scores from BLEU for SYSTRAN were substantially worse than the scores given to SYSTRAN by the human judges.
They note that the SMT systems were trained using BLEU minimum error rate training, and point out that this could be one of the reasons behind the difference.
They conclude by recommending that BLEU be used in a more restricted manner, for comparing the results from two similar systems, and for tracking "broad, incremental changes to a single system".
Business intelligence
Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of business information and sometimes to the information itself.
The purpose of business intelligence--a term that dates at least to 1958--is to support better business decision making.
Thus, BI is also described as a decision support system (DSS):
BI is sometimes used interchangeably with briefing books, report and query tools and executive information systems.
In general, business intelligence systems are data-driven DSS.
BI systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a data warehouse or a data mart and occasionally working from operational data.
Software elements support the use of this information by assisting in the extraction, analysis, and reporting of information.
Applications tackle sales, production, financial, and many other sources of business data for purposes that include, notably, business performance management.
Information may be gathered on comparable companies to produce benchmarks.
History
Prior to the start of the Information Age in the late 20th century, businesses had to collect data from non-automated sources.
Businesses then lacked the computing resources necessary to properly analyze the data, and as a result, companies often made business decisions primarily on the basis of intuition.
As businesses automated systems the amount of data increased but its collection remained difficult due to the inability of information to be moved between or within systems.
Analysis of information informed for long-term decision making, but was slow and often required the use of instinct or expertise to make short-term decisions.
Business intelligence was defined in 1958 by Hans Peter Luhn, who wrote,
In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera.
The communication facility serving the conduct of a business (in the broad sense) may be referred to as an intelligence system.
The notion of intelligence is also defined here, in a more general sense, as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."
In 1989 Howard Dresner, later a Gartner Group analyst, popularized BI as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems."
In modern businesses the use of standards, automation and specialized software, including analytical tools, allows large volumes of data to be extracted, transformed, loaded and warehoused to greatly increase the speed at which information becomes available for decision-making.
Key intelligence topics
Business intelligence often uses key performance indicators (KPIs) to assess the present state of business and to prescribe a course of action.
Examples of KPIs are things such as lead conversion rate (in sales) and inventory turnover (in inventory management).
Prior to the widespread adoption of computer and web applications, when information had to be manually input and calculated, performance data was often not available for weeks or months.
Recently, banks have tried to make data available at shorter intervals and have reduced delays.
The KPI methodology was further expanded with the Chief Performance Officer methodology which incorporated KPIs and root cause analysis into a single methodology.
Businesses that face higher operational/credit risk loading, such as credit card companies and "wealth management" services, often make KPI-related data available weekly.
In some cases, companies may even offer a daily analysis of data.
This fast pace requires analysts to use IT systems to process this large volume of data.
Chatterbot
A chatterbot (or chatbot) is a type of conversational agent, a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods.
In other words, a chatterbot is a computer program with artificial intelligence to talk to people through voices or typed words.
Though many appear to be intelligently interpreting the human input prior to providing a response, most chatterbots simply scan for keywords within the input and pull a reply with the most matching keywords or the most similar wording pattern from a local database.
Chatterbots may also be referred to as talk bots, chat bots, or chatterboxes.
Method of operation
A good understanding of a conversation is required to carry on a meaningful dialog but most chatterbots do not attempt this.
Instead they "converse" by recognizing cue words or phrases from the human user, which allows them to use pre-prepared or pre-calculated responses which can move the conversation on in an apparently meaningful way without requiring them to know what they are talking about.
For example, if a human types, "I am feeling very worried lately," the chatterbot may be programmed to recognize the phrase "I am" and respond by replacing it with "Why are you" plus a question mark at the end, giving the answer, "Why are you feeling very worried lately?"
A similar approach using keywords would be for the program to answer any comment including (Name of celebrity) with "I think they're great, don't you?"
Humans, especially those unfamiliar with chatterbots, sometimes find the resulting conversations engaging.
Critics of chatterbots call this engagement the ELIZA effect.
Some programs classified as chatterbots use other principles.
One example is Jabberwacky, which attempts to model the way humans learn new facts and language.
ELLA attempts to use natural language processing to make more useful responses from a human's input.
Some programs that use natural language conversation, such as SHRDLU, are not generally classified as chatterbots because they link their speech ability to knowledge of a simulated world.
This type of link requires a more complex artificial intelligence (eg., a "vision" system) than standard chatterbots have.
Early chatterbots
The classic early chatterbots are ELIZA and PARRY.
More recent programs are Racter, Verbots, A.L.I.C.E., and ELLA.
The growth of chatterbots as a research field has created an expansion in their purposes.
While ELIZA and PARRY were used exclusively to simulate typed conversation, Racter was used to "write" a story called The Policeman's Beard is Half Constructed.
ELLA includes a collection of games and functional features to further extend the potential of chatterbots.
The term "ChatterBot" was coined by Michael Mauldin (Creator of the first Verbot, Julia) in 1994 to describe these conversational programs.
Malicious chatterbots
Malicious chatterbots are frequently used to fill chat rooms with spam and advertising, or to entice people into revealing personal information, such as bank account numbers.
They are commonly found on Yahoo! Messenger, Windows Live Messenger, AOL Instant Messenger and other instant messaging protocols.
There has been a published report of a chatterbot used in a fake personal ad on a dating service's website.
Chatterbots in modern AI
Most modern AI research focuses on practical engineering tasks.
This is known as weak AI and is distinguished from strong AI, which would require sapience and reasoning abilities.
One pertinent field of AI research is natural language.
Usually weak AI fields employ specialised software or programming languages created for them.
For example, one of the 'most-human' natural language chatterbots, A.L.I.C.E., uses a programming language called AIML that is specific to its program, and its various clones, named Alicebots.
Nevertheless, A.L.I.C.E. is still based on pattern matching without any reasoning.
This is the same technique ELIZA, the first chatterbot, was using back in 1966.
Australian company MyCyberTwin also deals in strong AI, allowing users to create and sustain their own virtual personalities online.
MyCyberTwin.com also works in a corporate setting, allowing companies to set up Virtual AI Assistants.
Another notable program, known as Jabberwacky, also deals in strong AI, as it is claimed to learn new responses based on user interactions, rather than being driven from a static database like many other existing chatterbots.
Although such programs show initial promise, many of the existing results in trying to tackle the problem of natural language still appear fairly poor, and it seems reasonable to state that there is currently no general purpose conversational artificial intelligence.
This has led some software developers to focus more on the practical aspect of chatterbot technology - information retrieval.
A common rebuttal often used within the AI community against criticism of such approaches asks, "How do we know that humans don't also just follow some cleverly devised rules?" (in the way that Chatterbots do).
Two famous examples of this line of argument against the rationale for the basis of the Turing test are John Searle's Chinese room argument and Ned Block's Blockhead argument.
Chatterbots/Virtual Assistants in Commercial Environments
Automated Conversational Systems have progressed and evolved far from the original designs of the first widely used chatbots.
In the UK, large commercial entities such as Lloyds TSB, Royal Bank of Scotland, Renault, Citroën and One Railway are already utilizing Virtual Assistants to reduce expenditures on Call Centres and provide a first point of contact that can inform the user exactly of points of interest, provide support, capture data from the user and promote products for sale.
In the UK, new projects and research are being conducted to introduce a Virtual Assistant into the classroom to assist the teacher.
This project is the first of its kind and the chatbot VA in question is based on the Yhaken www.elzware.com chatbot design.
The Yhaken template provides a further move forward in Automated Conversational Systems with features such as complex conversational routing and responses, well defined personality, a complex hierarchical construct with additional external reference points, emotional responses and in depth small talk, all to make the experience more interactive and involving for the user.
Annual contests for chatterbots
Many organizations tries to encourage and support developers all over the world to develop chatterbots that able to do variety of tasks and compete with each other through turing tests and more.
Annual contests are organized at the following links:
The Chatterbox Challenge
The Loebner Prize
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective.
This modeling is not limited to any particular field of linguistics.
Traditionally, computational linguistics was usually performed by computer scientists who had specialized in the application of computers to the processing of a natural language.
Recent research has shown that human language is much more complex than previously thought, so computational linguists often work as members of interdisciplinary teams, including linguists (specifically trained in linguistics), language experts (persons with some level of ability in the languages relevant to a given project), and computer scientists.
In general computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists, mathematicians, and logicians, amongst others.
Origins
Computational linguistics as a field predates artificial intelligence, a field under which it is often grouped.
Computational linguistics originated with efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English.
Since computers had proven their ability to do arithmetic much faster and more accurately than humans, it was thought to be only a short matter of time before the technical details could be taken care of that would allow them the same remarkable capacity to process language.
When machine translation (also known as mechanical translation) failed to yield accurate translations right away, automated processing of human languages was recognized as far more complex than had originally been assumed.
Computational linguistics was born as the name of the new field of study devoted to developing algorithms and software for intelligently processing language data.
When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehension and production of natural languages.
In order to translate one language into another, it was observed that one had to understand the grammar of both languages, including both morphology (the grammar of word forms) and syntax (the grammar of sentence structure).
In order to understand syntax, one had to also understand the semantics and the lexicon (or 'vocabulary'), and even to understand something of the pragmatics of language use.
Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process natural languages using computers.
Subfields
Computational linguistics can be divided into major areas depending upon the medium of the language being processed, whether spoken or textual; and upon the task being performed, whether analyzing language (recognition) or synthesizing language (generation).
Speech recognition and speech synthesis deal with how spoken language can be understood or created using computers.
Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together.
Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages.
Some of the areas of research that are studied by computational linguistics include:
Computer aided corpus linguistics
Design of parsers or chunkers for natural languages
Design of taggers like POS-taggers (part-of-speech taggers)
Definition of specialized logics like resource logics for NLP
Research in the relation between formal and natural languages in general
Machine translation, e.g. by a translating computer
Computational complexity of natural language, largely modeled on automata theory, with the application of context-sensitive grammar and linearly-bounded Turing machines.
The Association for Computational Linguistics defines computational linguistics as:
...the scientific study of language from a computational perspective.
Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.
Computer program
Computer programs (also software programs, or just programs) are instructions for a computer.
A computer requires programs to function, and a computer program does nothing unless its instructions are executed by a central processor.
Computer programs are usually executable programs or the source code from which executable programs are derived (e.g., compiled).
Computer source code is often written by professional computer programmers.
Source code is written in a programming language that usually follows one of two main paradigms: imperative or declarative programming.
Source code may be converted into an executable file (sometimes called an executable program) by a compiler.
Alternatively, computer programs may be executed by a central processing unit with the aid of an interpreter, or may be embedded directly into hardware.
Computer programs may be categorized along functional lines: system software and application software.
And many computer programs may run simultaneously on a single computer, a process known as multitasking.
Programming
Source code of a program written in the C programming language
Computer programming is the iterative process of writing or editing source code.
Editing source code involves testing, analyzing, and refining.
A person who practices this skill is referred to as a computer programmer or software developer.
The sometimes lengthy process of computer programming is usually referred to as software development.
The term software engineering is becoming popular as the process is seen as an engineering discipline.
Paradigms
Computer programs can be categorized by the programming language paradigm used to produce them.
Two of the main paradigms are imperative and declarative.
Programs written using an imperative language specify an algorithm using declarations, expressions, and statements.
A declaration associates a variable name with a datatype.
For example:
.
An expression yields a value.
For example:
yields 4.
Finally, a statement might assign an expression to a variable or use the value of a variable to alter the program's control flow.
For example:
One criticism of imperative languages is the side-effect of an assignment statement on a class of variables called non-local variables.
Programs written using a declarative language specify the properties that have to be met by the output and do not specify any implementation details.
Two broad categories of declarative languages are functional languages and logical languages.
The principle behind functional languages (like Haskell) is to not allow side-effects, which makes it easier to reason about programs like mathematical functions.
The principle behind logical languages (like Prolog) is to define the problem to be solved — the goal — and leave the detailed solution to the Prolog system itself.
The goal is defined by providing a list of subgoals.
Then each subgoal is defined by further providing a list of its subgoals, etc.
If a path of subgoals fails to find a solution, then that subgoal is backtracked and another path is systematically attempted.
The form in which a program is created may be textual or visual.
In a visual language program, elements are graphically manipulated rather than textually specified.
Compilation or interpretation
A computer program in the form of a human-readable, computer programming language is called source code.
Source code may be converted into an executable image by a compiler or executed immediately with the aid of an interpreter.
Compiled computer programs are commonly referred to as executables, binary images, or simply as binaries — a reference to the binary file format used to store the executable code.
Compilers are used to translate source code from a programming language into either object code or machine code.
Object code needs further processing to become machine code, and machine code is the Central Processing Unit's native code, ready for execution.
Interpreted computer programs are either decoded and then immediately executed or are decoded into some efficient intermediate representation for future execution.
BASIC, Perl, and Python are examples of immediately executed computer programs.
Alternatively, Java computer programs are compiled ahead of time and stored as a machine independent code called bytecode.
Bytecode is then executed upon request by an interpreter called a virtual machine.
The main disadvantage of interpreters is computer programs run slower than if compiled.
Interpreting code is slower than running the compiled version because the interpreter must decode each statement each time it is loaded and then perform the desired action.
On the other hand, software development may be quicker using an interpreter because testing is immediate when the compilation step is omitted.
Another disadvantage of interpreters is the interpreter must be present on the computer at the time the computer program is executed.
Alternatively, compiled computer programs need not have the compiler present at the time of execution.
No properties of a programming language require it to be exclusively compiled or exclusively interpreted.
The categorization usually reflects the most popular method of language execution.
For example, BASIC is thought of as an interpreted language and C a compiled language, despite the existence of BASIC compilers and C interpreters.
Self-modifying programs
A computer program in execution is normally treated as being different from the data the program operates on.
However, in some cases this distinction is blurred when a computer program modifies itself.
The modified computer program is subsequently executed as part of the same program.
Self-modifying code is possible for programs written in Lisp, COBOL, and Prolog.
Execution and storage
Typically, computer programs are stored in non-volatile memory until requested either directly or indirectly to be executed by the computer user.
Upon such a request, the program is loaded into random access memory, by a computer program called an operating system, where it can be accessed directly by the central processor.
The central processor then executes ("runs") the program, instruction by instruction, until termination.
A program in execution is called a process.
Termination is either by normal self-termination or by error — software or hardware error.
Embedded programs
Some computer programs are embedded into hardware.
A stored-program computer requires an initial computer program stored in its read-only memory to boot.
The boot process is to identify and initialize all aspects of the system, from CPU registers to device controllers to memory contents.
Following the initialization process, this initial computer program loads the operating system and sets the program counter to begin normal operations.
Independent of the host computer, a hardware device might have embedded firmware to control its operation.
Firmware is used when the computer program is rarely or never expected to change, or when the program must not be lost when the power is off.
Manual programming
Computer programs historically were manually input to the central processor via switches.
An instruction was represented by a configuration of on/off settings.
After setting the configuration, an execute button was pressed.
This process was then repeated.
Computer programs also historically were manually input via paper tape or punched cards.
After the medium was loaded, the starting address was set via switches and the execute button pressed.
Automatic program generation
Generative programming is a style of computer programming that creates source code through generic classes, prototypes, templates, aspects, and code generators to improve programmer productivity.
Source code is generated with programming tools such as a template processor or an Integrated Development Environment.
The simplest form of source code generator is a macro processor, such as the C preprocessor, which replaces patterns in source code according to relatively simple rules.
Software engines output source code or markup code that simultaneously become the input to another computer process.
The analogy is that of one process driving another process, with the computer code being burned as fuel.
Application servers are software engines that deliver applications to client computers.
For example, a Wiki is an application server that allows users to build dynamic content assembled from articles.
Wikis generate HTML, CSS, Java, and Javascript which are then interpreted by a web browser.
Simultaneous execution
Many operating systems support multitasking which enables many computer programs to appear to be running simultaneously on a single computer.
Operating systems may run multiple programs through process scheduling — a software mechanism to switch the CPU among processes frequently so that users can interact with each program while it is running.
Within hardware, modern day multiprocessor computers or computers with multicore processors may run multiple programs.
Functional categories
Computer programs may be categorized along functional lines.
These functional categories are system software and application software.
System software includes the operating system which couples the computer's hardware with the application software.
The purpose of the operating system is to provide an environment in which application software executes in a convenient and efficient manner.
In addition to the operating system, system software includes utility programs that help manage and tune the computer.
If a computer program is not system software then it is application software.
Application software includes middleware, which couples the system software with the user interface.
Application software also includes utility programs that help users solve application problems, like the need for sorting.
Computer science
Computer science (or computing science) is the study and the science of the theoretical foundations of information and computation and their implementation and application in computer systems.
Computer science has many sub-fields; some emphasize the computation of specific results (such as computer graphics), while others relate to properties of computational problems (such as computational complexity theory).
Still others focus on the challenges in implementing computations.
For example, programming language theory studies approaches to describing computations, while computer programming applies specific programming languages to solve specific computational problems.
A further subfield, human-computer interaction, focuses on the challenges in making computers and computations useful, usable and universally accessible to people.
History
The early foundations of what would become computer science predate the invention of the modern digital computer.
Machines for calculating fixed numerical tasks, such as the abacus, have existed since antiquity.
Wilhelm Schickard built the first mechanical calculator in 1623.
Charles Babbage designed a difference engine in Victorian times (between 1837 and 1901) helped by Ada Lovelace.
Around 1900, the IBM corporation sold punch-card machines.
However, all of these machines were constrained to perform a single task, or at best some subset of all possible tasks.
During the 1940s, as newer and more powerful computing machines were developed, the term computer came to refer to the machines rather than their human predecessors.
As it became clear that computers could be used for more than just mathematical calculations, the field of computer science broadened to study computation in general.
Computer science began to be established as a distinct academic discipline in the 1960s, with the creation of the first computer science departments and degree programs.
Since practical computers became available, many applications of computing have become distinct areas of study in their own right.
Many initially believed it impossible that "computers themselves could actually be a scientific field of study" (Levy 1984, p. 11), though it was in the "late fifties" (Levy 1984, p.11) that it gradually became accepted among the greater academic population.
It is the now well-known IBM brand that formed part of the computer science revolution during this time.
'IBM' (short for International Business Machines) released the IBM 704 and later the IBM 709 computers, which were widely used during the exploration period of such devices.
"Still, working with the IBM [computer] was frustrating...if you had misplaced as much as one letter in one instruction, the program would crash, and you would have to start the whole process over again" (Levy 1984, p.13).
During the late 1950s, the computer science discipline was very much in its developmental stages, and such issues were commonplace.
Time has seen significant improvements in the useability and effectiveness of computer science technology.
Modern society has seen a significant shift from computers being used solely by experts or professionals to a more widespread user base.
By the 1990s, computers became accepted as being the norm within everyday life.
During this time data entry was a primary component of the use of computers, many preferring to streamline their business practices through the use of a computer.
This also gave the additional benefit of removing the need of large amounts of documentation and file records which consumed much-needed physical space within offices.
Major achievements
Despite its relatively short history as a formal academic discipline, computer science has made a number of fundamental contributions to science and society.
These include:
Applications within computer science
A formal definition of computation and computability, and proof that there are computationally unsolvable and intractable problems.
The concept of a programming language, a tool for the precise expression of methodological information at various levels of abstraction.
Applications outside of computing
Sparked the Digital Revolution which led to the current Information Age and the Internet.
In cryptography, breaking the Enigma machine was an important factor contributing to the Allied victory in World War II.
Scientific computing enabled advanced study of the mind and mapping the human genome was possible with Human Genome Project.
Distributed computing projects like Folding@home explore protein folding.
Algorithmic trading has increased the efficiency and liquidity of financial markets by using artificial intelligence, machine learning and other statistical and numerical techniques on a large scale.
Relationship with other fields
Despite its name, a significant amount of computer science does not involve the study of computers themselves.
Because of this, several alternative names have been proposed.
Danish scientist Peter Naur suggested the term datalogy, to reflect the fact that the scientific discipline revolves around data and data treatment, while not necessarily involving computers.
The first scientific institution to use the term was the Department of Datalogy at the University of Copenhagen, founded in 1969, with Peter Naur being the first professor in datalogy.
The term is used mainly in the Scandinavian countries.
Also, in the early days of computing, a number of terms for the and practitioners of the field of computing were suggested in the Communications are of the ACM—turingineer, turologist, flow-charts-man, applied meta-mathematician, and applied epistemologist.
Three months later in the same journal, comptologist was suggested, followed next year by hypologist.
Recently the term computics has been suggested.
Informatik was a term used in Europe with more frequency.
The renowned computer scientist Edsger Dijkstra stated, "Computer science is no more about computers than astronomy is about telescopes."
The design and deployment of computers and computer systems is generally considered the province of disciplines other than computer science.
For example, the study of computer hardware is usually considered part of computer engineering, while the study of commercial computer systems and their deployment is often called information technology or information systems.
Computer science is sometimes criticized as being insufficiently scientific, a view espoused in the statement "Science is to computer science as hydrodynamics is to plumbing", credited to Stan Kelly-Bootle and others.
However, there has been much cross-fertilization of ideas between the various computer-related disciplines.
Computer science research has also often crossed into other disciplines, such as cognitive science, economics, mathematics, physics (see quantum computing), and linguistics.
Computer science is considered by some to have a much closer relationship with mathematics than many scientific disciplines.
Early computer science was strongly influenced by the work of mathematicians such as Kurt Gödel and Alan Turing, and there continues to be a useful interchange of ideas between the two fields in areas such as mathematical logic, category theory, domain theory, and algebra.
The relationship between computer science and software engineering is a contentious issue, which is further muddied by disputes over what the term "software engineering" means, and how computer science is defined.
David Parnas, taking a cue from the relationship between other engineering and science disciplines, has claimed that the principal focus of computer science is studying the properties of computation in general, while the principal focus of software engineering is the design of specific computations to achieve practical goals, making the two separate but complementary disciplines.
The academic, political, and funding aspects of computer science tend to have roots as to whether a department in the U.S. formed with either a mathematical emphasis or an engineering emphasis.
In general, electrical engineering-based computer science departments have tended to succeed as computer science and/or engineering departments.
Computer science departments with a mathematics emphasis and with a numerical orientation consider alignment computational science.
Both types of departments tend to make efforts to bridge the field educationally if not across all research.
Fields of computer science
Computer science searches for concepts and formal proofs to explain and describe computational systems of interest.
As with all sciences, these theories can then be utilised to synthesize practical engineering applications, which in turn may suggest new systems to be studied and analysed.
While the ACM Computing Classification System can be used to split computer science up into different topics of fields, a more descriptive breakdown follows:
Mathematical foundations
Mathematical logic
Boolean logic and other ways of modeling logical queries; the uses and limitations of formal proof methods.
Number theory
Theory of proofs and heuristics for finding proofs in the simple domain of integers.
Used in cryptography as well as a test domain in artificial intelligence.
Graph theory
Foundations for data structures and searching algorithms.
Type theory
Formal analysis of the types of data, and the use of these types to understand properties of programs, especially program safety.
Category theory
Category theory provides a means of capturing all of math and computation in a single synthesis.
Computational geometry
The study of algorithms to solve problems stated in terms of geometry.
Numerical analysis
Foundations for algorithms in discrete mathematics, as well as the study of the limitations of floating point computation, including round-off errors.
Theory of computation
Automata theory
Different logical structures for solving problems.
Computability theory
What is calculable with the current models of computers.
Proofs developed by Alan Turing and others provide insight into the possibilities of what can be computed and what cannot.
Computational complexity theory
Fundamental bounds (especially time and storage space) on classes of computations; in practice, study of which problems a computer can solve with reasonable resources (while computability theory studies which problems can be solved at all).
Quantum computing theory
Representation and manipulation of data using the quantum properties of particles and quantum mechanism.
Algorithms and data structures
Analysis of algorithms
Time and space complexity of algorithms.
Algorithms
Formal logical processes used for computation, and the efficiency of these processes.
Programming languages and compilers
Compilers
Ways of translating computer programs, usually from higher level languages to lower level ones.
Interpreters
A program that takes in as input a computer program and executes it.
Programming languages
Formal language paradigms for expressing algorithms, and the properties of these languages (e.g., what problems they are suited to solve).
Concurrent, parallel, and distributed systems
Concurrency
The theory and practice of simultaneous computation; data safety in any multitasking or multithreaded environment.
Distributed computing
Computing using multiple computing devices over a network to accomplish a common objective or task and thereby reducing the latency involved in single processor contributions for any task.
Parallel computing
Computing using multiple concurrent threads of execution.
Software engineering
Algorithm design
Using ideas from algorithm theory to creatively design solutions to real tasks
Computer programming
The practice of using a programming language to implement algorithms
Formal methods
Mathematical approaches for describing and reasoning about software designs.
Reverse engineering
The application of the scientific method to the understanding of arbitrary existing software
Software development
The principles and practice of designing, developing, and testing programs, as well as proper engineering practices.
System architecture
Computer architecture
The design, organization, optimization and verification of a computer system, mostly about CPUs and memory subsystems (and the bus connecting them).
Computer organization
The implementation of computer architectures, in terms of descriptions of their specific electrical circuitry
Operating systems
Systems for managing computer programs and providing the basis of a useable system.
Communications
Computer audio
Algorithms and data structures for the creation, manipulation, storage, and transmission of digital audio recordings.
Also important in voice recognition applications.
Networking
Algorithms and protocols for communicating data across different shared or dedicated media, often including error correction.
Cryptography
Applies results from complexity, probability and number theory to invent and break codes.
Databases
Data mining
Data mining is the extraction of relevant data from all sources of data.
Relational databases
Study of algorithms for searching and processing information in documents and databases; closely related to information retrieval.
OLAP
Online Analytical Processing, or OLAP, is an approach to quickly provide answers to analytical queries that are multi-dimensional in nature.
OLAP is part of the broader category business intelligence, which also encompasses relational reporting and data mining.
Artificial intelligence
Artificial intelligence
The implementation and study of systems that exhibit an autonomous intelligence or behaviour of their own.
Artificial life
The study of digital organisms to learn about biological systems and evolution.
Automated reasoning
Solving engines, such as used in Prolog, which produce steps to a result given a query on a fact and rule database.
Computer vision
Algorithms for identifying three dimensional objects from one or more two dimensional pictures.
Machine learning
Automated creation of a set of rules and axioms based on input.
Natural language processing/Computational linguistics
Automated understanding and generation of human language
Robotics
Algorithms for controlling the behavior of robots.
Visual rendering (or Computer graphics)
Computer graphics
Algorithms both for generating visual images synthetically, and for integrating or altering visual and spatial information sampled from the real world.
Image processing
Determining information from an image through computation.
Human-Computer Interaction
Human computer interaction
The study of making computers and computations useful, usable and universally accessible to people, including the study and design of computer interfaces through which people use computers.
Scientific computing
Bioinformatics
The use of computer science to maintain, analyse, and store biological data, and to assist in solving biological problems such as protein folding, function prediction and phylogeny.
Cognitive Science
Computational modelling of real minds
Computational chemistry
Computational modelling of theoretical chemistry in order to determine chemical structures and properties
Computational neuroscience
Computational modelling of real brains
Computational physics
Numerical simulations of large non-analytic systems
Numerical algorithms
Algorithms for the numerical solution of mathematical problems such as root-finding, integration, the solution of ordinary differential equations and the approximation/evaluation of special functions.
Symbolic mathematics
Manipulation and solution of expressions in symbolic form, also known as Computer algebra.
Didactics of computer science/informatics
The subfield didactics of computer science focuses on cognitive approaches of developing competencies of computer science and specific strategies for analysis, design, implementation and evaluation of excellent lessons in computer science.
Computer science education
Some universities teach computer science as a theoretical study of computation and algorithmic reasoning.
These programs often feature the theory of computation, analysis of algorithms, formal methods, concurrency theory, databases, computer graphics and systems analysis, among others.
They typically also teach computer programming, but treat it as a vessel for the support of other fields of computer science rather than a central focus of high-level study.
Other colleges and universities, as well as secondary schools and vocational programs that teach computer science, emphasize the practice of advanced computer programming rather than the theory of algorithms and computation in their computer science curricula.
Such curricula tend to focus on those skills that are important to workers entering the software industry.
The practical aspects of computer programming are often referred to as software engineering.
However, there is a lot of disagreement over what the term "software engineering" actually means, and whether it is the same thing as programming.
Corpus linguistics
Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text.
This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language.
Originally done by hand, corpora are largely derived by an automated process, which is corrected.
Computational methods had once been viewed as a holy grail of linguistic research, which would ultimately manifest a ruleset for natural language processing and machine translation at a high level.
Such has not been the case, and since the cognitive revolution, cognitive linguistics has been largely critical of many claimed practical uses for corpora.
However, as computation capacity and speed have increased, the use of corpora to study language and term relationships en masse has gained some respectability.
The corpus approach runs counter to Noam Chomsky's view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting.
Corpus linguistics does away with Chomsky's competence/performance split; adherents believe that reliable language analysis best occurs on field-collected samples, in natural contexts and with minimal experimental interference.
History
A landmark in modern corpus linguistics was the publication by Henry Kucera and Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources.
Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, language teaching, psychology, statistics, and sociology.
A further key publication was Randolph Quirk's 'Towards a description of English Usage' (1960, Transactions of the Philological Society, 40-61) in which he introduced The Survey of English Usage.
Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new American Heritage Dictionary, the first dictionary to be compiled using corpus linguistics.
The AHD made the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used).
Other publishers followed suit.
The British publisher Collins' COBUILD monolingual learner's dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English.
The Brown Corpus has also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New Zealand English), Australian Corpus of English (Australian English), the Frown Corpus (early 1990s American English), and the FLOB Corpus (1990s British English).
Other corpora represent many languages, varieties and modes, and include the International Corpus of English, and the British National Corpus, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (Oxford and Lancaster) and the British Library.
For contemporary American English, work has stalled on the American National Corpus, but the 360 million word Corpus of Contemporary American English (COCA) (1990-present) is now available.
Methods
This means dealing with real input data, where descriptions based on a linguist's intuition are not usually helpful.
Cross-platform
Cross-platform (also known as multi-platform) is a term used in computing to refer to computer programs, operating systems, computer languages, programming languages, or other computer software and their implementations which can be made to work on multiple computer platforms.
“Cross-platform” and “multi-platform” both refer to the idea that a given piece of computer software is able to be run on more than one computer platform.
There are two major types of cross-platform software; one requires building for each platform that it supports (e.g., is written in a compiled language, such as Pascal), and the other one can be directly run on any platform which supports it (e.g., software written in an interpreted language such as Perl, Python, or shell script) or software written in a language which compiles to bytecode and the bytecode is redistributed (such as is the case with Java and languages used in the .NET Framework) such as Chrome.
For example, a cross-platform application may run on Microsoft Windows on the x86 architecture, Linux on the x86 architecture and Mac OS X on either the PowerPC or x86 based Apple Macintosh systems.
A cross-platform application may run on as many as all existing platforms, or on as few as two platforms.
Platforms
A platform is a combination of hardware and software used to run software applications.
A platform can be described simply as an operating system or computer architecture, or it could be the combination of both.
Probably the most familiar platform is Microsoft Windows running on the x86 architecture.
Other well-known desktop computer platforms include Linux and Mac OS X (both of which are themselves cross-platform).
There are, however, many devices such as cellular telephones that are also effectively computer platforms but less commonly thought about in that way.
Application software can be written to depend on the features of a particular platform—either the hardware, operating system, or virtual machine it runs on.
The Java platform is a virtual machine platform which runs on many operating systems and hardware types, and is a common platform for software to be written for.
Hardware platforms
A hardware platform can refer to a computer’s architecture or processor architecture.
For example, the x86 and x86-64 CPUs make up one of the most common computer architectures in use in home machines today.
These machines commonly run Microsoft Windows, though they can run other operating systems as well, including Linux, OpenBSD, NetBSD, Mac OS X and FreeBSD.
Software platforms
Software platforms can either be an operating system or programming environment, though more commonly it is a combination of both.
A notable exception to this is Java, which uses an operating system independent virtual machine for its compiled code, known in the world of Java as bytecode.
Examples of software platforms include:
MS-DOS (x86), DR-DOS (x86), FreeDOS (x86) etc.
Microsoft Windows (x86, x64)
Linux (x86, x64, PowerPC, various other architectures)
Mac OS X (PowerPC, x86)
OS/2, eComStation
AmigaOS (m68k), AROS (x86, PowerPC, m68k), MorphOS (PowerPC)
Java
Java platform
As previously noted, the Java platform is an exception to the general rule that an operating system is a software platform.
The Java language provides a virtual machine, or a “virtual CPU” which runs all of the code that is written for the language.
This enables the same executable binary to run on all systems which support the Java software, through the Java Virtual Machine.
Java executables do not run directly on the operating system; that is, neither Windows nor Linux execute Java programs directly.
Because of this, however, Java is limited in that it does not directly support system-specific functionality.
JNI can be used to access system specific functions, but then the code is likely no longer portable.
Java programs can run on at least the Microsoft Windows, Mac OS X, Linux, and Solaris operating systems, and so the language is limited to functionality that exists on all these systems.
This includes things such as computer networking, Internet sockets, but not necessarily raw hardware input/output.
Cross-platform software
In order for software to be considered cross-platform, it must be able to function on more than one computer architecture or operating system.
This can be a time-consuming task given that different operating systems have different application programming interfaces or APIs (for example, Linux uses a different API for application software than Windows does).
Just because a particular operating system may run on different computer architectures, that does not mean that the software written for that operating system will automatically work on all architectures that the operating system supports.
One example as of August, 2006 was OpenOffice.org, which did not natively run on the AMD64 or EM64T lines of processors implementing the x86-64 64-bit standards for computers; this has since been changed, and the OpenOffice.org suite of software is “mostly” ported to these 64-bit systemswiki.services.openoffice.org/wiki/Porting_to_x86-64_(AMD64,_EM64T).
This also means that just because a program is written in a popular programming language such as C or C++, it does not mean it will run on all operating systems that support that programming language.
Web applications
Web applications are typically described as cross-platform because, ideally, they are accessible from any of various web browsers within different operating systems.
Such applications generally employ a client-server system architecture, and vary widely in complexity and functionality.
This wide variability significantly complicates the goal of cross-platform capability, which is routinely at odds with the goal of advanced functionality.
Basic applications
Basic web applications perform all or most processing from a stateless web server, and pass the result to the client web browser.
All user interaction with the application consists of simple exchanges of data requests and server responses.
These types of applications were the norm in the early phases of World Wide Web application development.
Such applications follow a simple transaction model, identical to that of serving static web pages.
Today, they are still relatively common, especially where cross-platform compatibility and simplicity are deemed more critical than advanced functionality.
Advanced applications
Prominent examples of advanced web applications include the Web interface to Gmail, A9.com, and the maps.live.com section of Live Search.
Such advanced applications routinely depend on additional features found only in the more recent versions of popular web browsers.
These dependencies include Ajax, JavaScript, “Dynamic” HTML, SVG, and other components of rich internet applications.
Older versions of popular browsers tend to lack support for certain features.
Design strategies
Because of the competing interests of cross-platform compatibility and advanced functionality, numerous alternative web application design strategies have emerged.
Such strategies include:
Graceful degradation
Graceful degradation attempts to provide the same or similar functionality to all users and platforms, while diminishing that functionality to a ‘least common denominator’ for more limited client browsers.
For example, a user attempting to use a limited-feature browser to access Gmail may notice that Gmail switches to “Basic Mode,” with reduced functionality.
Some view this strategy as a lesser form of cross-platform capability.
Separation of functionality
Separation of functionality attempts to simply omit those subsets of functionality that are not capable from within certain client browsers or operating systems, while still delivering a ‘complete’ application to the user. (see also Separation of concerns).
Multiple codebase
Multiple codebase applications present different versions of an application depending on the specific client in use.
This strategy is arguably the most complicated and expensive way to fulfill cross-platform capability, since even different versions of the same client browser (within the same operating system) can differ dramatically between each other.
This is further complicated by the support for “plugins” which may or may not be present for any given installation of a particular browser version.
Third party libraries
Third party libraries attempt to simplify cross-platform capability by ‘hiding’ the complexities of client differentiation behind a single, unified API.
Testing strategies
One complicated aspect of cross-platform web application design is the need for software testing.
In addition to the complications mentioned previously, there is the additional restriction that some browsers prohibit installation of different versions of the same browser on the same operating system.
Techniques such as full virtualization are sometimes used as a workaround for this problem.
Traditional applications
Although web applications are becoming increasingly popular, many computer users still use traditional application software which does not rely on a client/web-server architecture.
The distinction between “traditional” and “web” applications is not always unambiguous, however, because applications have many different features, installation methods and architectures; and some of these can overlap and occur in ways that blur the distinction.
Nevertheless, this simplifying distinction is a common and useful generalization.
Binary software
Traditionally in modern computing, application software has been distributed to end-users as binary images, which are stored in executables, a specific type of binary file.
Such executables only support the operating system and computer architecture that they were built for—which means that making a “cross-platform executable” would be something of a massive task, and is generally not done.
For software that is distributed as a binary executable, such as software written in C or C++, the programmer must build the software for each different operating system and computer architecture.
For example, Mozilla Firefox, an open-source web browser, is available on Microsoft Windows, Mac OS X (both PowerPC and x86 through something Apple calls a Universal binary), and Linux on multiple computer architectures.
The three platforms (in this case, Windows, Mac OS X, and Linux) are separate executable distributions, although they come from the same source code.
In the context of binary software, cross-platform programs are written in the source code and then “translated” to each system that it runs on through compiling it on different platforms.
Also, software can be ported to a new computer architecture or operating system so that the program becomes more cross-platform than it already is.
For example, a program such as Firefox, which already runs on Windows on the x86 family, can be modified and re-built to run on Linux on the x86 (and potentially other architectures) as well.
As an alternative to porting, cross-platform virtualization allows applications compiled for one CPU and operating system to run on a system with a different CPU and/or operating system, without modification to the source code or binaries.
As an example, Apple's Rosetta software, which is built into Intel-based Apple Macintosh computers, runs applications compiled for the previous generation of Macs that used PowerPC CPUs.
Another example is IBM PowerVM Lx86, which allows Linux/x86 applications to run unmodified on the Linux/Power operating system.
Scripts and interpreted languages
A script can be considered to be cross-platform if the scripting language is available on multiple platforms and the script only uses the facilities provided by the language.
That is, a script written in Python for a Unix-like system will likely run with little or no modification on Windows, because Python also runs on Windows; there is also more than one implementation of Python that will run the same scripts (e.g., IronPython for .NET).
The same goes for many of the open source programming languages that are available and are scripting languages.
Unlike binary executables, the same script can be used on all computers that have software to interpret the script.
This is because the script is generally stored in plain text in a text file.
There may be some issues, however, such as the type of new line character that sits between the lines.
Generally, however, little or no work has to be done to make a script written for one system, run on another.
Some quite popular cross-platform scripting or interpreted languages are:
bash—A Unix shell commonly run on Linux and other modern Unix-like systems, as well as on Windows via the Cygwin POSIX compatibility layer.
Python—A modern scripting language where the focus is on rapid application development and ease-of-writing, instead of program run-time efficiency.
Perl—A scripting language first released in 1987.
Used for CGI WWW programming, small system administration tasks, and more.
PHP—A scripting language most popular in use on the WWW for web applications.
Ruby—A scripting language who's purpose is to be object-oriented and easy to read.
Can also be used on the web through Ruby on Rails.
Tcl - A dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.
Video games
Cross-platform is a term that can also apply to video games.
Such games are released on a range of video game consoles and handheld game consoles, which are specialized computers dedicated to the task of playing games (and thus, are a platform as any other computer).
Examples of these games include:
Miner 2049er, the first major multiplatform game
Phantasy Star Online
Lara Croft Tomb Raider: Legend
FIFA Series
Shadow of Legend
… which are spread across a variety of platforms, such as the Nintendo GameCube, PlayStation 2, Xbox, PC, and mobile devices.
In some cases, depending on the hardware of a particular system it may take longer than expected to create a video game across multiple platforms.
So, a video game may only get released on a few platforms and then later released on the remaining platforms.
Typically, this is what occurs when a new system is released, because the developers of the video game need to become acquainted with the hardware and software associated with the new console.
Some games may not become cross-platform because of licensing agreements between the developers and the maker of the video game console which state that the game will only be made for one particular console.
As an example, Disney could create a new game and wish to release it on the latest Nintendo and Sony game consoles.
If Disney licenses the game with Sony first, Disney may be required to only release the game on Sony’s console for a short time, or indefinitely—effectively prohibiting the game from cross-platform at least for a period of time.
Several developers have developed ways to play games online while using different platforms.
Epic Games, Microsoft and Valve Software all have this technology, that allows Xbox 360 gamers and PS3 gamers to play with PC gamers, allowing gamers to finally decide which platform is the best for a game.
The first game released to allow this interactivity between PC and Console games was Quake 3.
Games that feature cross-platform online play include:
Champions Online
Lost Planet: Colonies
Phantasy Star Online
Shadowrun
UNO
Final Fantasy XI Online
Platform independent software
Software that is platform independent does not rely on any special features of any single platform, or, if it does, handles those special features such that it can deal with multiple platforms.
All algorithms, such as the quicksort algorithm, are able to be implemented on different platforms.
Cross-platform programming
Cross-platform programming is the practice of actively writing software that will work on more than one platform.
Approaches to cross-platform programming
There are different ways of approaching the problem of writing a cross-platform application program.
One such approach is simply to create multiple versions of the same program in different source trees—in other words, the Windows version of a program might have one set of source code files and the Macintosh version might have another, while a FOSS *nix system might have another.
While this is a straightforward approach to the problem, it has the potential to be considerably more expensive in development cost, development time, or both, especially for the corporate entities.
The idea behind this is to create more than two different programs that have the ability to behave similarly to each other.
It is also possible that this means of developing a cross-platform application will result in more problems with bug tracking and fixing, because the two different source trees would have different programmers, and thus different defects in each version.
The smaller the programming team, the quicker the bug fixes tend to be.
Another approach that is used is to depend on pre-existing software that hides the differences between the platforms—called abstraction of the platform—such that the program itself is unaware of the platform it is running on.
It could be said that such programs are platform agnostic.
Programs that run on the Java Virtual Machine (JVM) are built in this fashion.
Some applications mix various methods of cross-platform programming to create the final application.
An example of this is the Firefox web browser, which uses abstraction to build some of the lower-level components, separate source subtrees for implementing platform specific features (like the GUI), and the implementation of more than one scripting language to help facilitate ease of portability.
Firefox implements XUL, CSS and JavaScript for extending the browser, in addition to classic Netscape-style browser plugins.
Much of the browser itself is written in XUL, CSS, and JavaScript, as well.
Cross-platform programming toolkits
There are a number of tools which are available to help facilitate the process of cross-platform programming:
Simple DirectMedia Layer—An open source cross-platform multimedia library written in C that creates an abstraction over various platforms’ graphics, sound, and input APIs.
It runs on many operating systems including Linux, Windows and Mac OS X and is aimed at games and multimedia applications.
Cairo−A free software library used to provide a vector graphics-based, device-independent API.
It is designed to provide primitives for 2-dimensional drawing across a number of different backends.
Cairo is written in C and has bindings for many programming languages.
ParaGUI—ParaGUI is a cross-platform high-level application framework and GUI library.
It can be compiled on various platforms(Linux, Win32, BeOS, Mac OS, ...).
ParaGUI is based on the Simple DirectMedia Layer (SDL).
ParaGUI is targeted on crossplatform multimedia applications and embedded devices operating on framebuffer displays.
wxWidgets—An open source widget toolkit that is also an application framework.
It runs on Unix-like systems with X11, Microsoft Windows and Mac OS X. It permits applications written to use it to run on all of the systems that it supports, if the application does not use any operating system-specific programming in addition to it.
Qt—An application framework and widget toolkit for Unix-like systems with X11, Microsoft Windows, Mac OS X, and other systems—available under both open source and commercial licenses.
GTK+—An open source widget toolkit for Unix-like systems with X11 and Microsoft Windows.
FLTK—Another open source cross platform toolkit, but more light weight because it restricts itself to the GUI.
Mozilla—An open source platform for building Mac, Windows and Linux applications.
Mono (and more specifically, Microsoft .NET)—A cross-platform framework for applications and programming languages.
molib—A robust commercial application toolkit library that abstracts the system calls through C++ objects (such as the file system, database system and thread implementation.).
This allows for the creation of applications that compile and run under Microsoft Windows, Mac OS X, GNU/Linux, and other uses (Sun OS, AIX, HP-UX, 32/64 bit, SMP).
Use in concert with the sandbox to create GUI-based applications.
fpGUI - An open source widget toolkit that is completely implemented in Object Pascal.
It currently supports Linux, Windows and a bit of Windows CE.
fpGUI does not rely on any large libraries, instead it talks directly to Xlib (Linux) or GDI (Windows).
The framework is compiled with the Free Pascal compiler.
Mac OS support is also in the works.
Tcl/Tk - Tcl (Tool Command Language) is a dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.
Open source and business-friendly, Tcl is a mature yet evolving language that is truly cross platform, easily deployed and highly extensible.
Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches.
Tk is the standard GUI not only for Tcl, but for many other dynamic languages, and can produce rich, native applications that run unchanged across Windows, Mac OS X, Linux and more.
The combination of Tcl and the Tk GUI toolkit is referred to as Tcl/Tk.
XVT is a cross-platform toolkit for creating enterprise and desktop applications in C/C++ on Windows, Linux and Unix (Solaris, HPUX, AIX), and Mac.
Most recent release is 5.8, in April 2007
Cross-platform development environments
Cross-platform applications can also be built using proprietary IDEs, or so-called Rapid Application Development tools.
There are a number of development environments which allow developers to build and deploy applications across multiple platforms:
Eclipse—An Open source software framework and IDE extendable through plug-ins including the C++ Development Toolkit.
Eclipse is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).
IntelliJ IDEA—A proprietary IDE
NetBeans—An Open source software framework and IDE extendable through plug-ins.
NetBeans is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).
Similar to Eclipse in features and functionality.
Promoted by Sun Microsystems
Omnis Studio—A proprietary IDE or Rapid Application Development tool for creating enterprise and web applications for Windows, Linux, and Mac OS X.
Runtime Revolution—a proprietary IDE, compiler engine and CGI builder that cross compiles to Windows, Mac OS X (PPC, Intel), Linux, Solaris, BSD, and Irix.
Code::Blocks—A free/open source, cross platform IDE.
It is developed in C++ using wxWidgets.
Using a plugin architecture, its capabilities and features are defined by the provided plugins.
Lazarus (software)—Lazarus is a cross platform Visual IDE developed for and supported by the open source Free Pascal compiler.
It aims to provide a Rapid Application Development Delphi Clone for Pascal and Object Pascal developers.
REALbasic—REALbasic (RB) is an object-oriented dialect of the BASIC programming language developed and commercially marketed by REAL Software, Inc in Austin, Texas for Mac OS X, Microsoft Windows, and Linux.
Criticisms of cross-platform development
There are certain issues associated with cross-platform development.
Some of these include:
Testing cross-platform applications may also be considerably more complicated, since different platforms can exhibit slightly different behaviors or subtle bugs.
This problem has led some developers to deride cross-platform development as “Write Once, Debug Everywhere”, a take on Sun’s “Write Once, Run Anywhere” marketing slogan.
Developers are often restricted to using the lowest common denominator subset of features which are available on all platforms.
This may hinder the application's performance or prohibit developers from using platforms’ most advanced features.
Different platforms often have different user interface conventions, which cross-platform applications do not always accommodate.
For example, applications developed for Mac OS X and GNOME are supposed to place the most important button on the right-hand side of windows and dialogs, whereas Microsoft Windows and KDE have the opposite convention.
Though many of these differences are subtle, a cross-platform application which does not conform appropriately to these conventions may feel clunky or alien to the user.
When working quickly, such opposing conventions may even result in data loss, such as in a dialog box confirming whether the user wants to save or discard changes to a file.
Scripting languages and virtual machines must be translated into native executable code each time the application is executed, imposing a performance penalty.
This performance hit can be alleviated using advanced techniques like just-in-time compilation; but even using such techniques, some performance overhead may be unavoidable.
Data
Data (singular: datum) are collected of natural phenomena descriptors including the results of experience, observation or experiment, or a set of premises.
This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables.
Etymology
The word data is the plural of Latin datum, neuter past participle of dare, "to give", hence "something given".
The past participle of "to give" has been used for millennia, in the sense of a statement accepted at face value; one of the works of Euclid, circa 300 BC, was the Dedomena (in Latin, Data).
In discussions of problems in geometry, mathematics, engineering, and so on, the terms givens and data are used interchangeably.
Such usage is the origin of data as a concept in computer science: data are numbers, words, images, etc., accepted as they stand.
Pronounced dey-tuh, dat-uh, or dah-tuh.
Experimental data are data generated within the context of a scientific investigation.
Mathematically, data can be grouped in many ways.
Usage in English
In English, the word datum is still used in the general sense of "something given", and more specifically in cartography, geography, geology, NMR and drafting to mean a reference point, reference line, or reference surface.
More generally speaking, any measurement or result can be called a (single) datum, but data point is more common.
Both datums (see usage in datum article) and the originally Latin plural data are used as the plural of datum in English, but data is more commonly treated as a mass noun and used in the singular, especially in day-to-day usage.
For example, "This is all the data from the experiment".
This usage is inconsistent with the rules of Latin grammar and traditional English, which would instead suggest "These are all the data from the experiment".
Some British and UN academic, scientific, and professional style guides (e.g., see page 43 of the World Health Organization Style Guide) request that authors treat data as a plural noun.
Other international organization, such as the IEEE computing society , allow its usage as either a mass noun or plural based on author preference.
It is now usually treated as a singular mass noun in informal usage, but usage in scientific publications shows a strong UK/U.S divide.
U.S. usage tends to treat data in the singular, including in serious and academic publishing, although some major newspapers (such as the New York Times) regularly use it in the plural.
"The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.”
Sometimes scientists think of data as plural, as in These data do not support the conclusions.
But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage.
"www.bartleby.com/61/51/D0035100.html UK usage now widely accepts treating data as singular in standard English, including everyday newspaper usage at least in non-scientific use.
UK scientific publishing usually still prefers treating it as a plural..
Some UK university style guides recommend using data for both singular and plural use and some recommend treating it only as a singular in connection with computers.
Uses of data in science and computing
Raw data are numbers, characters, images or other outputs from devices to convert physical quantities into symbols, in a very broad sense.
Such data are typically further processed by a human or input into a computer, stored and processed there, or transmitted (output) to another human or computer.
Raw data is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.
Mechanical computing devices are classified according to the means by which they represent data.
An analog computer represents a datum as a voltage, distance, position, or other physical quantity.
A digital computer represents a datum as a sequence of symbols drawn from a fixed alphabet.
The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1".
More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.
Some special forms of data are distinguished.
A computer program is a collection of data, which can be interpreted as instructions.
Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data.
It is also useful to distinguish metadata, that is, a description of other data.
A similar yet earlier term for metadata is "ancillary data."
The prototypical example of metadata is the library catalog, which is a description of the contents of books.
Meaning of data, information and knowledge
The terms information and knowledge are frequently used for overlapping concepts.
The main difference is in the level of abstraction being considered.
Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.
For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".
Information as a concept bears a diversity of meanings, from everyday usage to technical settings.
Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
Beynon-Davies uses the concept of a sign to distinguish between data and information.
Data are symbols.
Information occurs when symbols are used to refer to something.
Data analysis
Data analysis is the process of looking at and summarizing data with the intent to extract useful information and develop conclusions.
Data analysis is closely related to data mining, but data mining tends to focus on larger data sets, with less emphasis on making inference, and often uses data that was originally collected for a different purpose.
In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis and confirmatory data analysis, where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.
Data analysis assumes different aspects, and possibly different names, in different fields.
The term data analysis is also used as a synonym for data modeling, which is unrelated to the subject of this article.
Nuclear and particle physics
In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system.
It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information.
Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc.
The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.
The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4 that predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.
See also: Computational physics.
Social sciences
Qualitative data analysis (QDA) or qualitative research is the analysis of non-numerical data, for example words, photographs, observations, etc..
Information technology
A special case is the data analysis in information technology audits.
Business
See
Analytics
Business intelligence
Data mining
Database
A database is a structured collection of records or data.
A computer database relies upon software to organize the storage of data.
The software models the database structure in what are known as database models.
The model in most common use today is the relational model.
Other models such as the hierarchical model and the network model use a more explicit representation of relationships (see below for explanation of the various database models).
Database management systems (DBMS) are the software used to organize and maintain the database.
These are categorized according to the database model that they support.
The model tends to determine the query languages that are available to access the database.
A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures.
In these areas there are large differences between products.
History
The earliest known use of the term data base was in November 1963, when the System Development Corporation sponsored a symposium under the title Development and Management of a Computer-centered Data Base.
Database as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers.
(The abbreviation DB, however, survives.)
The first database management systems were developed in the 1960s.
A pioneer in the field was Charles Bachman.
Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on punched cards and magnetic tape, so that serial processing was the dominant activity.
Two key data models arose at this time: CODASYL developed the network model based on Bachman's ideas, and (apparently independently) the hierarchical model was used in a system developed by North American Rockwell later adopted by IBM as the cornerstone of their IMS product.
While IMS along with the CODASYL IDMS were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today.
Two worthy of mention are the PICK and MUMPS databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems.
The relational model was proposed by E. F. Codd in 1970.
He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms.
For a long while, however, the relational model remained of academic interest only.
While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time.
Among the first implementations were Michael Stonebraker's Ingres at Berkeley, and the System R project at IBM.
Both of these were research prototypes, announced during 1976.
The first commercial products, Oracle and DB2, did not appear until around 1980.
The first successful database product for microcomputers was dBASE for the CP/M and PC-DOS/MS-DOS operating systems.
During the 1980s, research activity focused on distributed database systems and database machines.
Another important theoretical idea was the Functional Data Model, but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.
In the 1990s, attention shifted to object-oriented databases.
These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as spatial databases, engineering data (including software repositories), and multimedia data.
Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result.
The 1990s also saw the spread of Open Source databases, such as PostgreSQL and MySQL.
In the 2000s, the fashionable area for innovation is the XML database.
As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products.
XML databases aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.
Database models
Various techniques are used to model data structure.
Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model.
For any one logical model various physical implementations may be possible, and most products will offer the user some level of control in tuning the physical implementation, since the choices that are made have a significant effect on performance.
Here are three examples:
Hierarchical model
In a hierarchical model, data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list.
This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files.
Each unit in the model is a record which is also known as a node.
In such a model, each record on one level can be related to multiple records on the next lower level.
A record that has subsidiary records is called a parent and the subsidiary records are called children.
Data elements in this model are well suited for one-to-many relationships with other data elements in the database.
This model is advantageous when the data elements are inherently hierarchical.
The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated.
Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization.
Network model
The network model tends to store records with links to other records.
Each record in the database can have multiple parents, i.e., the relationships among data elements can have a many to many relationship.
Associations are tracked via "pointers".
These pointers can be node numbers or disk addresses.
Most network databases tend to also include some form of hierarchical model.
Databases can be translated from hierarchical model to network and vice versa.
The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent.
The network model provides greater advantage than the hierarchical model in that promotes greater flexibility and data accessibility, since records at a lower level can be accessed without accessing the records above them.
This model is more efficient than hierarchical model, easier to understand and can be applied to many real world problems that require routine transactions.
The disadvantages are that: It is a complex process to design and develop a network database; It has to be refined frequently; It requires that the relationships among all the records be defined before development starts, and changes often demand major programming efforts; Operation and maintenance of the network model is expensive and time consuming.
Examples of database engines that have network model capabilities are RDM Embedded and RDM Server.
Relational model
The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows.
The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number).
Rows (also called records) represent instances of an entity (e.g. specific employees).
The "relation" in "relational database" comes from the mathematical notion of relations from the field of set theory.
A relation is a set of tuples, so rows are sometimes called tuples.
All tables in a relational database adhere to three basic rules.
The ordering of columns is immaterial
Identical rows are not allowed in a table
Each row has a single (separate) value for each of its columns (each tuple has an atomic value).
If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records.
Relationships between records are often categorized by their cardinality (1:1, (0), 1:M, M:M).
Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values.
A "primary key" is a key that has a unique value for each row in the table.
Keys are commonly used to join or combine data from two or more tables.
For example, an employee table may contain a column named address which contains a value that matches the key of an address table.
Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables.
It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.
Relational operations
Users (or programs) request data from a relational database by sending it a query that is written in a special language, usually a dialect of SQL.
Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface.
Many web applications, such as Wikipedia, perform SQL queries when generating pages.
In response to a query, the database returns a result set, which is the list of rows constituting the answer.
The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted.
Often, data from multiple tables are combined into one, by doing a join.
There are a number of relational operations in addition to join.
Normal forms
Relations are classified based upon the types of anomalies to which they're vulnerable.
A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies.
Normal forms are hierarchical in nature.
That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal form.
Database Management Systems
Relational database management systems
An RDBMS implements the features of the relational model outlined above.
In this context, Date's Information Principle states:
The entire information content of the database is represented in one and only one way.
Namely as explicit values in column positions (attributes) and rows in relations (tuples) Therefore, there are no explicit pointers between related tables.
Post-relational database models
Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations.
Products using a post-relational data model typically employ a model that actually pre-dates the relational model.
These might be identified as a directed graph with trees on the nodes.
Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS.
Object database models
In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases.
These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program.
This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects).
At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.
A variety of these ways have been tried for storing objects in a database.
Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent.
This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content.
Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.
DBMS internals
Storage and physical database design
Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered flat files, ISAM, heaps, hash buckets or B+ trees.
These have various advantages and disadvantages discussed further in the main article on this topic.
The most commonly used are B+ trees and ISAM.
Other important design choices relate to the clustering of data by category (such as grouping data by month, or location), creating pre-computed views known as materialized views, partitioning data by range or hash.
As well memory management and storage topology can be important design choices for database designers.
Just as normalization is used to reduce storage requirements and improve the extensibility of the database, conversely denormalization is often used to reduce join complexity and reduce execution time for queries.
Indexing
All of these databases can take advantage of indexing to increase their speed.
This technology has advanced tremendously since its early uses in the 1960s and 1970s.
The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value.
An index allows a set of table rows matching some criterion to be located quickly.
Typically, indexes are also stored in the various forms of data-structure mentioned above (such as B-trees, hashes, and linked lists).
Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required.
Relational DBMS's have the advantage that indexes can be created or dropped without changing existing applications making use of it.
The database chooses between many different strategies based on which one it estimates will run the fastest.
In other words, indexes are transparent to the application or end-user querying the database; while they affect performance, any SQL command will run with or without index to compute the result of an SQL statement.
The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest.
Some of the key algorithms that deal with joins are nested loop join, sort-merge join and hash join.
Which of these is chosen depends on whether an index exists, what type it is, and its cardinality.
An index speeds up access to data, but it has disadvantages as well.
First, every index increases the amount of storage on the hard drive necessary for the database file, and second, the index must be updated each time the data are altered, and this costs time.
(Thus an index saves time in the reading of data, but it costs time in entering and altering data.
It thus depends on the use to which the data are to be put whether an index is on the whole a net plus or minus in the quest for efficiency.)
A special case of an index is a primary index, or primary key, which is distinguished in that the primary index must ensure a unique reference to a record.
Often, for this purpose one simply uses a running index number (ID number).
Primary indexes play a significant role in relational databases, and they can speed up access to data considerably.
Transactions and concurrency
In addition to their data model, most practical databases ("transactional databases") attempt to enforce a database transaction .
Ideally, the database software should enforce the ACID rules, summarized here:
Atomicity: Either all the tasks in a transaction must be done, or none of them.
The transaction must be completed, or else it must be undone (rolled back).
Consistency: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database.
It cannot place the data in a contradictory state.
Isolation: Two simultaneous transactions cannot interfere with one another.
Intermediate results within a transaction are not visible to other transactions.
Durability: Completed transactions cannot be aborted later or their results discarded.
They must persist through (for instance) restarts of the DBMS after crashes
In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.
Concurrency control is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules.
The DBMS must be able to ensure that only serializable, recoverable schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions .
Replication
Replication of databases is closely related to transactions.
If a database can log its individual actions, it is possible to create a duplicate of the data in real time.
The duplicate can be used to improve performance or availability of the whole database system.
Common replication concepts include:
Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves
Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas.
Multimaster: Two or more replicas sync each other via a transaction identifier.
Parallel synchronous replication of databases enables transactions to be replicated on multiple servers simultaneously, which provides a method for backup and security as well as data availability.
Security
Database security denotes the system, processes, and procedures that protect a database from unintended activity.
Security is usually enforced through access control, auditing, and encryption.
Access control ensures and restricts who can connect and what can be done to the database.
Auditing logs what action or change has been performed, when and by who.
Encryption: Since security has become a major issue in recent years, many commercial database vendors provide built-in encryption mechanism.
Data is encoded natively into the tables and deciphered "on the fly" when a query comes in.
Connections can also be secured and encrypted if required using DSA, MD5, SSL or legacy encryption standard.
Enforcing security is one of the major tasks of the DBA.
In the United Kingdom, legislation protecting the public from unauthorized disclosure of personal information held on databases falls under the Office of the Information Commissioner.
United Kingdom based organizations holding personal data in electronic format (databases for example) are required to register with the Data Commissioner.
Locking
Locking is how the database handle multiple concurent operations.
This is the way how concurency and some form of basic intergrity is managed within the database system.
Such locks can be applied on a row level, or on other levels like page (a basic data block), extend (multiple array of pages) or even an entire table.
This helps maintain the integrity of the data by ensuring that only one process at a time can modify the same data.
Unlike a basic filesystem files or folders, where only one lock at the time can be set, restricting the usage to one process only.
A database can set and hold mutiples locks at the same time on the different level of the physical data structure.
How locks are set, last is determined by the database engine locking scheme based on the submitted SQL or transactions by the users.
Generaly speaking no activity on the database should be translated by no or very light locking.
For most DBMS systems existing on the market, locks are generaly shared or exclusive.
Exclusive locks mean that no other lock can acquire the current data object as long as the exclusive lock lasts.
Exclusive locks are usually set while the database needs to change data, like during an UPDATE or DELETE operation.
Shared locks can take ownership one from the other of the current data structure.
Shared locks are usually used while the database is reading data, during a SELECT operation.
The number, nature of locks and time the lock holds a data block can have a huge impact on the database performances.
Bad locking can lead to desastrous performance response (usually the result of poor SQL requests, or inadequate database physical structure)
Default locking behavior is enforced by the isolation level of the dataserver.
Changing the isolation level will affect how shared or exclusive locks must be set on the data for the entire database system.
Default isolation is generaly 1, where data can not be read while it is modfied, forbiding to return "ghost data" to end user.
At some point intensive or inappropriate exclusive locking, can lead to the "dead lock" situation between two locks.
Where none of the locks can be released because they try to acquire ressources mutually from each other.
The Database has a fail safe mecanism and will automaticly "sacrifice" one of the locks releasing the ressource.
Doing so processes or transactions involved in the "dead lock" will be rolled back.
Databases can also be locked for other reasons, like access restrictions for given levels of user.
Databases are also locked for routine database maintenance, which prevents changes being made during the maintenance.
See IBM for more detail.)
Architecture
Depending on the intended use, there are a number of database architectures in use.
Many databases use a combination of strategies.
On-line Transaction Processing systems (OLTP) often use a row-oriented datastore architecture, while data-warehouse and other retrieval-focused applications like Google's BigTable, or bibliographic database(library catalogue) systems may use a column-oriented datastore architecture.
Document-Oriented, XML, Knowledgebases, as well as frame databases and rdf-stores (aka Triple-Stores), may also use a combination of these architectures in their implementation.
Finally it should be noted that not all database have or need a database 'schema' (so called schema-less databases).
Applications of databases
Databases are used in many applications, spanning virtually the entire range of computer software.
Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed.
Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology.
Software database drivers are available for most database platforms so that application software can use a common Application Programming Interface to retrieve the information stored in a database.
Two commonly used database APIs are JDBC and ODBC.
For example suppliers database contains the data relating to suppliers such as;
supplier name
supplier code
supplier address
It is often used by schools to teach students and grade them.
Links to DBMS products
4D
ADABAS
Alpha Five
Apache Derby (Java, also known as IBM Cloudscape and Sun Java DB)
BerkeleyDB
CouchDB
CSQL
Datawasp
Db4objects
dBase
FileMaker
Firebird (database server)
H2 (Java)
Hsqldb (Java)
IBM DB2
IBM IMS (Information Management System)
IBM UniVerse
Informix
Ingres
Interbase
InterSystems Caché
MaxDB (formerly SapDB)
Microsoft Access
Microsoft SQL Server
Model 204
MySQL
Nomad
Objectivity/DB
ObjectStore
OpenLink Virtuoso
OpenOffice.org Base
Oracle Database
Paradox (database)
Polyhedra DBMS
PostgreSQL
Progress 4GL
RDM Embedded
ScimoreDB
Sedna
SQLite
Superbase
Sybase
Teradata
Vertica
Visual FoxPro
Cluster analysis
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.
Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
The computational task of classifying the data set into k clusters is often referred to as k-clustering.
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
Types of clustering
Data clustering algorithms can be hierarchical.
Hierarchical algorithms find successive clusters using previously established clusters.
Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down").
Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters.
Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
Another important distinction is whether the clustering uses symmetric or asymmetric distances.
A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A).
In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
Distance measure
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated.
This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another.
For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
The Euclidean distance (also called distance as the crow flies or 2-norm distance).
A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
The Manhattan distance (also called taxicab norm or 1-norm)
The maximum norm
The Mahalanobis distance corrects data for different scales and correlations in the variables
The angle between two vectors can be used as a distance measure when clustering high dimensional data.
See Inner product space.
The Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
Hierarchical clustering
Creating clusters
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters.
The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other.
Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root.
(In the figure, the arrows indicate an agglomerative clustering.)
Cutting the tree at a given height will give a clustering at a selected precision.
In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.
Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
Agglomerative hierarchical clustering
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
The hierarchical clustering dendrogram would be as such:
This method builds the hierarchy from the individual elements by progressively merging clusters.
In our example, we have six elements {a} {b} {c} {d} {e} and {f}.
The first step is to determine which elements to merge in a cluster.
Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements.
Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated.
This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters.
A simple agglomerative clustering algorithm is described in the single linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further.
To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters.
Usually the distance between two clusters and is one of the following:
The maximum distance between elements of each cluster (also called complete linkage clustering):
The minimum distance between elements of each cluster (also called single linkage clustering):
The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
The sum of all intra-cluster variance
The increase in variance for the cluster being merged (Ward's criterion)
The probability that candidate clusters spawn from the same distribution function (V-linkage)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
Concept clustering
Another variation of the agglomerative clustering approach is conceptual clustering.
Partitional clustering
K-means and derivatives
K-means clustering
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest.
The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster...
Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3).
Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
The algorithm steps are (J. MacQueen, 1967):
Choose the number of clusters, k.
Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
Assign each point to the nearest cluster center.
Recompute the new cluster centers.
Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets.
Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
Fuzzy c-means clustering
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster.
Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster.
For each point x we have a coefficient giving the degree of being in the kth cluster .
Usually, the sum of those coefficients is defined to be 1:
With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
The degree of belonging is related to the inverse of the distance to the cluster
then the coefficients are normalized and fuzzyfied with a real parameter so that their sum is 1.
So
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1.
When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
The fuzzy c-means algorithm is very similar to the k-means algorithm:
Choose a number of clusters.
Assign randomly to each point coefficients for being in the clusters.
Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than , the given sensitivity threshold) :
Compute the centroid for each cluster, using the formula above.
For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.
The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.
It has better convergence properties and is in general preferred to fuzzy-c-means.
QT clustering algorithm
QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering.
It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
The algorithm is:
The user chooses a maximum diameter for clusters.
Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration.
Must clarify what happens if more than 1 cluster has the maximum number of points ?
Recurse with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
Locality-sensitive hashing
Locality-sensitive hashing can be used for clustering.
Feature space vectors are sets, and the metric used is the Jaccard distance.
The feature space can be considered high-dimensional.
The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets.
With just one set of hashing methods, there are only clusters of very similar elements.
By seeding the hash functions several times (eg 20), it is possible to get bigger clusters.
Graph-theoretic methods
Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes.
Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
Elbow criterion
The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen, for example for k-means and agglomerative hierarchical clustering.
It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance.
Thus, it is appropriate to re-run the cluster analysis multiple times.
The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information.
More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow).
This elbow cannot always be unambiguously identified.
Percentage of variance explained is the ratio of the between-group variance to the total variance.
On the following graph, the elbow is indicated by the red circle.
The number of clusters chosen should therefore be 4.
Spectral clustering
Given a set of data points A, the similarity matrix may be defined as a matrix where represents a measure of the similarity between points .
Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
One such technique is the Shi-Malik algorithm, commonly used for image segmentation.
It partitions points into two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the Laplacian matrix
of , where is the diagonal matrix
This partitioning may be done in various ways, such as by taking the median of the components in , and placing all points whose component in is greater than in , and the rest in .
The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors.
Applications
Biology
In biology clustering has many applications
In imaging, data clustering may take different form based on the data dimensionality.
For example, the SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computational libraries.
In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
In computational biology and bioinformatics:
In transcriptomics, clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes).
Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated.
High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
In sequence analysis, clustering is used to group homologous sequences into gene families.
This is a very important concept in bioinformatics, and evolutionary biology in general.
See evolution by gene duplication.
In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes.
Medicine
In medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image.
In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time.
This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
Market research
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels.
Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.
Segmenting the market and determining target markets
Product positioning
New product development
Selecting test markets (see : experimental techniques)
Other applications
Social network analysis: In the study of social networks, clustering may be used to recognize communities within large groups of people.
Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples.
Another common application is the division of documents, such as World Wide Web pages, into genres.
Search result grouping: In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google.
There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization: Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map.
This makes it both faster and reduces the amount of visual clutter.
IMRT segmentation: Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Grouping of Shopping Items: Clustering can be used to group all the shopping items available on the web into a set of unique products.
For example, all the items on eBay can be grouped into unique products.
(eBay doesn't have the concept of a SKU)
Mathematical chemistry: To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.
Petroleum Geology: Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
Comparisons between data clusterings
There have been several suggestions for a measure of similarity between two clusterings.
Such a measure can be used to compare how well different data clustering algorithms perform on a set of data.
Many of these measures are derived from the matching matrix (aka confusion matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures.
Marina Meila's Variation of Information metric is a more recent approach for measuring distance between clusterings.
It uses mutual information and entropy to approximate the distance between two clusterings across the lattice of possible clusterings.
Algorithms
In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998).
Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996).
Data mining
Data mining is the process of sorting through large amounts of data and picking out relevant information.
It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods.
It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases.
" Data mining in relation to enterprise resource planning is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.
Background
Traditionally, business analysts have performed the task of extracting useful information from recorded data, but the increasing volume of data in modern business and science calls for computer-based approaches.
As data sets have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools.
The modern technologies of computers, networks, and sensors have made data collection and organization much easier.
However, the captured data needs to be converted into information and knowledge to become useful.
Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, to data.
Data mining identifies trends within data that go beyond simple analysis.
Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities.
However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all.
Although data mining is a relatively new term, the technology is not.
For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining).
Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis.
Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded.
Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable.
The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction.
Knowledge discovery provides explicit information that has a readable form and can be understood by a user.
Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as neural networks.
Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.
Metadata, or data about a given data set, are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining.
Common examples include executive summaries and scientific abstracts.
Data mining relies on the use of real world data.
This data is extremely vulnerable to collinearity precisely because data from the real world may have unknown interrelations.
An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed.
Alternative approaches using an experiment-based approach such as Choice Modelling for human-generated data may be used.
Inherent correlations are either controlled for or removed altogether through the construction of an experimental design.
Recently, there were some efforts to define a standard for data mining, for example the CRISP-DM standard for analysis processes or the Java Data-Mining Standard.
Independent of these standardization efforts, freely available open-source software systems like RapidMiner and Weka have become an informal standard for defining data-mining processes.
Privacy concerns
There are also privacy and human rights concerns associated with data mining, specifically regarding the source of the data analyzed.
Data mining provides information that may be difficult to obtain otherwise.
When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.
In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.
Notable uses of data mining
Combatting Terrorism
Data mining has been cited as the method by which the U.S. Army unit Able Danger had identified the September 11, 2001 attacks leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an Al Qaeda cell operating in the U.S. more than a year before the attack.
It has been suggested that both the Central Intelligence Agency and the Canadian Security Intelligence Service have employed this method.
Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program Security-MSNBC.
These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment.
Games
Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up.
This is the extraction of human-usable strategies from these oracles.
Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully.
Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns.
Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.
Business
Data mining in customer relationship management applications can contribute significantly to the bottom line.
Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted.
More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers.
Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.
Data clustering can also be used to automatically discover the segments or groups within a customer data set.
Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large.
Rather than one model to predict which customers will churn, a business could build a separate model for each region and customer type.
Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer.
And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable.
In order to maintain this quantity of models, they need to manage model versions and move to automated data mining.
Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees.
Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly.
Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.
Another example of data mining, often called the market basket analysis, relates to its use in retail sales.
If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones.
Although some explanations of relationships may be difficult, taking advantage of it is easier.
The example deals with association rules within transaction-based data.
Not all data are transaction based and logical or inexact rules may also be present within a database.
In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.
Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."
In this paper the application of data mining and decision analysis to the problem of die-level functional test is described.
Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing.
This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.
Science and engineering
In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education, and electrical power engineering.
In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility.
In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer.
This is very important to help improve the diagnosis, prevention and treatment of the diseases.
The data mining technique that is used to perform this task is known as multifactor dimensionality reduction.
In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment.
The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment.
Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS).
Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms.
Obviously, different tap positions will generate different signals.
However, there was considerable variability amongst normal condition signals for the exact same tap position.
SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.
Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers.
DGA, as a diagnostics for power transformer, has been available for centuries.
Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.
A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention.
Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis using SOM, et cetera.
Data set
A data set (or dataset) is a collection of data, usually presented in tabular form.
Each column represents a particular variable.
Each row corresponds to a given member of the data set in question.
It lists values for each of the variables, such as height and weight of an object or values of random numbers.
Each value is known as a datum.
The data set may comprise data for one or more members, corresponding to the number of rows.
Historically, the term originated in the mainframe field, where it had a well-defined meaning, very close to contemporary computer file.
This topic is not covered here.
In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list.
The values may be numbers, such as real numbers or integers, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity.
More generally, values may be of any of the kinds described as a level of measurement.
For each variable, the values will normally all be of the same kind.
However, there may also be "missing values", which need to be indicated in some way.
In statistics data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population.
Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.
Some modern statistical analysis software such as PSPP still present their data in the classical dataset fashion.
Classic data sets
Several classic data sets have been used extensively in the statistical literature:
Iris flower data set - multivariate data set introduced by Ronald Fisher (1936).
Categorical data analysis - Data sets used in the book, An Introduction to Categorical Data Analysis, by Agresti are provided on-line by StatLib.
Robust statistics - Data sets used in Robust Regression and Outlier Detection (Rousseeuw and Leroy, 1986). Provided on-line at the University of Cologne.
Time series - Data used in Chatfield's book, The Analysis of Time Series, are provided on-line by StatLib.
Extreme values - Data used in the book, An Introduction to the Statistical Modeling of Extreme Values are provided on-line by Stuart Coles, the book's author.
Bayesian Data Analysis - Data used in the book, Bayesian Data Analysis, are provided on-line by Andrew Gelman, one of the book's authors.
The Bupa liver data, used in several papers in the machine learning (data mining) literature.
ELIZA
ELIZA is a computer program by Joseph Weizenbaum, designed in 1966, which parodied a Rogerian therapist, largely by rephrasing many of the patient's statements as questions and posing them to the patient.
Thus, for example, the response to "My head hurts" might be "Why do you say your head hurts?"
The response to "My mother hates me" might be "Who else in your family hates you?"
ELIZA was named after Eliza Doolittle, a working-class character in George Bernard Shaw's play Pygmalion, who is taught to speak with an upper class accent.
Overview
It is sometimes inaccurately said that ELIZA simulates a therapist.
Weizenbaum said that ELIZA provided a "parody" of "the responses of a non-directional psychotherapist in an initial psychiatric interview."
He chose the context of psychotherapy to "sidestep the problem of giving the program a data base of real-world knowledge", the therapeutic situation being one of the few real human situations in which a human being can reply to a statement with a question that indicates very little specific knowledge of the topic under discussion.
For example, it is a context in which the question "Who is your favorite composer?" can be answered acceptably with responses such as "What about your own favorite composer?" or "Does that question interest you?"
First implemented in Weizenbaum's own SLIP list-processing language, ELIZA worked by simple parsing and substitution of key words into canned phrases.
Depending upon the initial entries by the user the illusion of a human writer could be instantly dispelled, or could continue through several interchanges.
It was sometimes so convincing that there are many anecdotes about people becoming very emotionally caught up in dealing with ELIZA for several minutes until the machine's true lack of understanding became apparent.
This was likely due to people's tendency to attach meanings to words which the computer never put there.
In 1966, interactive computing (via a teletype) was new.
It was 15 years before the personal computer became familiar to the general public, and two decades before most people encountered attempts at natural language processing in Internet services like Ask.com or PC help systems such as Microsoft Office Clippy.
Although those programs included years of research and work (while Ecala eclipsed the functionality of ELIZA after less than two weeks of work by a single programmer), ELIZA remains a milestone simply because it was the first time a programmer had attempted such a human-machine interaction with the goal of creating the illusion (however brief) of human-human interaction.
In the article "theNewMediaReader" an excerpt from "From Computer Power and Human Reason" by Joseph Weizenbaum in 1976, edited by Noah Wardrip-Fruin and Nick Montfort he references how quickly and deeply people became emotionally involved with the computer program, taking offence when he asked to view the transcripts, saying it was an invasion of their privacy, even asking him to leave the room while they were working with ELIZA.
Influence on games
ELIZA impacted a number of early computer games by demonstrating additional kinds of interface designs.
Don Daglow wrote an enhanced version of the program called Ecala on a PDP-10 mainframe computer at Pomona College in 1973 before writing what was possibly the second or third computer role-playing game, Dungeon (1975) (The first was probably "dnd", written on and for the PLATO system in 1974, and the second may have been Moria, written in 1975).
It is likely that ELIZA was also on the system where Will Crowther created Adventure, the 1975 game that spawned the interactive fiction genre.
But both these games appeared some nine years after the original ELIZA.
Response and legacy
Lay responses to ELIZA were disturbing to Weizenbaum and motivated him to write his book Computer Power and Human Reason: From Judgment to Calculation, in which he explains the limits of computers, as he wants to make clear in people's minds his opinion that the anthropomorphic views of computers are just a reduction of the human being and any life form for that matter.
There are many programs based on ELIZA in different languages in addition to Ecala.
For example, in 1980, a company called "Don't Ask Software", founded by Randy Simon, created a version for the Apple II, Atari, and Commodore PCs, which verbally abused the user based on the user's input.
In Spain, Jordi Perez developed the famous ZEBAL in 1993, written in Clipper for MS-DOS.
Other versions adapted ELIZA around a religious theme, such as ones featuring Jesus (both serious and comedic) and another Apple II variant called I Am Buddha.
The 1980 game The Prisoner incorporated ELIZA-style interaction within its gameplay.
ELIZA has also inspired a podcast called "The Eliza Podcast", in which the host engages in self-analysis using a computer generated voice prompting with questions in the same style as the ELIZA program.
Implementations
Using JavaScript: http://www.manifestation.com/neurotoys/eliza.php3
Source code in Java: http://chayden.net/eliza/Eliza.html
Another Java-implementation of ELIZA: http://www.wedesoft.demon.co.uk/eliza/
Using C on the TI-89: http://kaikostack.com/ti89_en.htm#eliza
Using z80 Assembly on the TI-83 Plus: http://www.ticalc.org/archives/files/fileinfo/354/35463.html
A perl module Chatbot::Eliza — example implementation
Trans-Tex Software has released shareware versions for Classic Mac OS and Mac OS X: http://www.tex-edit.com/index.html#Eliza
(circa 1985) in Emacs.
Source code in Tcl: http://wiki.tcl.tk/9235
The Indy Delphi oriented TCP/IP components suite has an Eliza implementation as demo.
Pop-11 Eliza in the poplog system.
Goes back to about 1976, when it was used for teaching AI at Sussex University.
Now part of the free open source Poplog system.
Source code in BASIC: http://www.atariarchives.org/bigcomputergames/showpage.php?page=22
ECC-Eliza for Windows (actual program is for DOS, but unpacker is for Windows) (rename .txt to .exe before running): http://www5.domaindlx.com/ecceliza1/ecceliza.txt.
More recent version at http://web.archive.org/web/20041117123025/http://www5.domaindlx.com/ecceliza1/ecceliza.txt.
English language
English is an Indo-European, West Germanic language originating in England, and is the first language for most people in the United Kingdom, the United States, Canada, Australia, New Zealand, Ireland, and the Anglophone Caribbean.
It is used extensively as a second language and as an official language throughout the world, especially in Commonwealth countries and in many international organizations.
Significance
Modern English, sometimes described as the first global lingua franca, is the dominant international language in communications, science, business, aviation, entertainment, radio and diplomacy.
The initial reason for its enormous spread beyond the bounds of the British Isles where it was originally a native tongue was the British Empire, and by the late nineteenth century its influence had won a truly global reach.
It is the dominant language in the United States and the growing economic and cultural influence of that federal union as a global superpower since World War II has significantly accelerated adoption of English as a language across the planet.
A working knowledge of English has become a requirement in a number of fields, occupations and professions such as medicine and as a consequence over a billion people speak English to at least a basic level (see English language learning and teaching).
Linguists such as David Crystal recognize that one impact of this massive growth of English, in common with other global languages, has been to reduce native linguistic diversity in many parts of the world historically, most particularly in Australasia and North America, and its huge influence continues to play an important role in language attrition.
By a similar token, historical linguists, aware of the complex and fluid dynamics of language change, are always alive to the potential English contains through the vast size and spread of the communities that use it and its natural internal variety, such as in its creoles and pidgins, to produce a new family of distinct languages over time.
English is one of six official languages of the United Nations.
History
English is a West Germanic language that originated from the Anglo-Frisian dialects brought to Britain by Germanic settlers and Roman auxiliary troops from various parts of what is now northwest Germany and the Northern Netherlands.
Initially, Old English was a diverse group of dialects, reflecting the varied origins of the Anglo-Saxon Kingdoms of England.
One of these dialects, Late West Saxon, eventually came to dominate.
The original Old English language was then influenced by two waves of invasion.
The first was by language speakers of the Scandinavian branch of the Germanic family; they conquered and colonized parts of Britain in the 8th and 9th centuries.
The second was the Normans in the 11th century, who spoke Old Norman and ultimately developed an English variety of this called Anglo-Norman.
These two invasions caused English to become "mixed" to some degree (though it was never a truly mixed language in the strict linguistic sense of the word; mixed languages arise from the cohabitation of speakers of different languages, who develop a hybrid tongue for basic communication).
Cohabitation with the Scandinavians resulted in a significant grammatical simplification and lexical supplementation of the Anglo-Frisian core of English; the later Norman occupation led to the grafting onto that Germanic core of a more elaborate layer of words from the Italic branch of the European languages.
This Norman influence entered English largely through the courts and government.
Thus, English developed into a "borrowing" language of great flexibility and with a huge vocabulary.
Classification and related languages
The English language belongs to the western sub-branch of the Germanic branch of the Indo-European family of languages.
The closest living relative of English is Scots, spoken primarily in Scotland and parts of Northern Ireland, which is viewed by linguists as either a separate language or a group of dialects of English.
The next closest relative to English after Scots is Frisian, spoken in the Northern Netherlands and Northwest Germany.
Other less closely related living West Germanic languages include Dutch, Low German, German and Afrikaans.
The North Germanic languages of Scandinavia are less closely related to English than the West Germanic languages.
Many French words are also intelligible to an English speaker (though pronunciations are often quite different) because English absorbed a large vocabulary from Norman and French, via Anglo-Norman after the Norman Conquest and directly from French in subsequent centuries.
As a result, a large portion of English vocabulary is derived from French, with some minor spelling differences (word endings, use of old French spellings, etc.), as well as occasional divergences in meaning, in so-called "faux amis", or false friends.
The pronunciation of French loanwords in English has become completely anglicized and follows a typically Germanic pattern of stress.
Geographical distribution
Approximately 375 million people speak English as their first language.
English today is probably the third largest language by number of native speakers, after Mandarin Chinese and Spanish.
However, when combining native and non-native speakers it is probably the most commonly spoken language in the world, though possibly second to a combination of the Chinese languages, depending on whether or not distinctions in the latter are classified as "languages" or "dialects."
Estimates that include second language speakers vary greatly from 470 million to over a billion depending on how literacy or mastery is defined.
There are some who claim that non-native speakers now outnumber native speakers by a ratio of 3 to 1.
The countries with the highest populations of native English speakers are, in descending order: United States (215 million), United Kingdom (58 million), Canada (18.2 million), Australia (15.5 million), Ireland (3.8 million), South Africa (3.7 million), and New Zealand (3.0-3.7 million).
Countries such as Jamaica and Nigeria also have millions of native speakers of dialect continua ranging from an English-based creole to a more standard version of English.
Of those nations where English is spoken as a second language, India has the most such speakers ('Indian English') and linguistics professor David Crystal claims that, combining native and non-native speakers, India now has more people who speak or understand English than any other country in the world.
Following India is the People's Republic of China.
Countries in order of total speakers
English is the primary language in Anguilla, Antigua and Barbuda, Australia (Australian English), the Bahamas, Barbados, Bermuda, Belize (Belizean Kriol), the British Indian Ocean Territory, the British Virgin Islands, Canada (Canadian English), the Cayman Islands, the Falkland Islands, Gibraltar, Grenada, Guam, Guernsey (Channel Island English), Guyana, Ireland (Hiberno-English), Isle of Man (Manx English), Jamaica (Jamaican English), Jersey, Montserrat, Nauru, New Zealand (New Zealand English), Pitcairn Islands, Saint Helena, Saint Kitts and Nevis, Saint Vincent and the Grenadines, Singapore, South Georgia and the South Sandwich Islands, Trinidad and Tobago, the Turks and Caicos Islands, the United Kingdom, the U.S. Virgin Islands, and the United States.
In many other countries, where English is not the most spoken language, it is an official language; these countries include Botswana, Cameroon, Dominica, Fiji, the Federated States of Micronesia, Ghana, Gambia, India, Kenya, Kiribati, Lesotho, Liberia, Madagascar, Malta, the Marshall Islands, Mauritius, Namibia, Nigeria, Pakistan, Palau, Papua New Guinea, the Philippines, Puerto Rico, Rwanda, the Solomon Islands, Saint Lucia, Samoa, Seychelles, Sierra Leone, Sri Lanka, Swaziland, Tanzania, Uganda, Zambia, and Zimbabwe.
It is also one of the 11 official languages that are given equal status in South Africa (South African English).
English is also the official language in current dependent territories of Australia (Norfolk Island, Christmas Island and Cocos Island) and of the United States (Northern Mariana Islands, American Samoa and Puerto Rico), and in the former British colony of Hong Kong.
English is an important language in several former colonies and protectorates of the United Kingdom but falls short of official status, such as in Malaysia, Brunei, United Arab Emirates and Bahrain.
English is also not an official language in either the United States or the United Kingdom.
Although the United States federal government has no official languages, English has been given official status by 30 of the 50 state governments.
English as a global language
Because English is so widely spoken, it has often been referred to as a "world language", the lingua franca of the modern era.
While English is not an official language in most countries, it is currently the language most often taught as a second language around the world.
Some linguists believe that it is no longer the exclusive cultural sign of "native English speakers", but is rather a language that is absorbing aspects of cultures worldwide as it continues to grow.
It is, by international treaty, the official language for aerial and maritime communications.
English is an official language of the United Nations and many other international organizations, including the International Olympic Committee.
English is the language most often studied as a foreign language in the European Union (by 89% of schoolchildren), followed by French (32%), German (18%), and Spanish (8%).
In the EU, a large fraction of the population reports being able to converse to some extent in English.
Among non-English speaking countries, a large percentage of the population claimed to be able to converse in English in the Netherlands (87%), Sweden (85%), Denmark (83%), Luxembourg (66%), Finland (60%), Slovenia (56%), Austria (53%), Belgium (52%), and Germany (51%).
Norway and Iceland also have a large majority of competent English-speakers.
Books, magazines, and newspapers written in English are available in many countries around the world.
English is also the most commonly used language in the sciences.
In 1997, the Science Citation Index reported that 95% of its articles were written in English, even though only half of them came from authors in English-speaking countries.
Dialects and regional varieties
The expansion of the British Empire and—since WWII—the primacy of the United States have spread English throughout the globe.
Because of that global spread, English has developed a host of English dialects and English-based creole languages and pidgins.
The major varieties of English include, in most cases, several subvarieties, such as Cockney within British English; Newfoundland English within Canadian English; and African American Vernacular English ("Ebonics") and Southern American English within American English.
English is a pluricentric language, without a central language authority like France's Académie française; and, although no variety is clearly considered the only standard, there are a number of accents considered to be more prestigious, such as Received Pronunciation in Britain.
Scots developed—largely independently—from the same origins, but following the Acts of Union 1707 a process of language attrition began, whereby successive generations adopted more and more features from English causing dialectalisation.
Whether it is now a separate language or a dialect of English better described as Scottish English is in dispute.
The pronunciation, grammar and lexis of the traditional forms differ, sometimes substantially, from other varieties of English.
Because of the wide use of English as a second language, English speakers have many different accents, which often signal the speaker's native dialect or language.
For the more distinctive characteristics of regional accents, see Regional accents of English, and for the more distinctive characteristics of regional dialects, see List of dialects of the English language.
Just as English itself has borrowed words from many different languages over its history, English loanwords now appear in a great many languages around the world, indicative of the technological and cultural influence of its speakers.
Several pidgins and creole languages have formed using an English base, such as Jamaican Patois, Nigerian Pidgin, and Tok Pisin.
There are many words in English coined to describe forms of particular non-English languages that contain a very high proportion of English words.
Franglais, for example, is used to describe French with a very high English word content; it is found on the Channel Islands.
Another variant, spoken in the border bilingual regions of Québec in Canada, is called Frenglish.
In Wales, which is part of the United Kingdom, the languages of Welsh and English are sometimes mixed together by fluent or comfortable Welsh speakers, the result of which is called Wenglish.
Constructed varieties of English
Basic English is simplified for easy international use.
It is used by manufacturers and other international businesses to write manuals and communicate.
Some English schools in Asia teach it as a practical subset of English for use by beginners.
Special English is a simplified version of English used by the Voice of America.
It uses a vocabulary of only 1500 words.
English reform is an attempt to improve collectively upon the English language.
Seaspeak and the related Airspeak and Policespeak, all based on restricted vocabularies, were designed by Edward Johnson in the 1980s to aid international cooperation and communication in specific areas.
There is also a tunnelspeak for use in the Channel Tunnel.
Euro-English is a concept of standardising English for use as a second language in continental Europe.
Manually Coded English — a variety of systems have been developed to represent the English language with hand signals, designed primarily for use in deaf education.
These should not be confused with true sign languages such as British Sign Language and American Sign Language used in Anglophone countries, which are independent and not based on English.
E-Prime excludes forms of the verb to be.
Euro-English (also EuroEnglish or Euro-English) terms are English translations of European concepts that are not native to English-speaking countries.
Because of the United Kingdom's (and even the Republic of Ireland's) involvement in the European Union, the usage focuses on non-British concepts.
This kind of Euro-English was parodied when English was "made" one of the constituent languages of Europanto.
Phonology
Vowels
Notes:
It is the vowels that differ most from region to region.
Where symbols appear in pairs, the first corresponds to American English, General American accent; the second corresponds to British English, Received Pronunciation.
American English lacks this sound; words with this sound are pronounced with or .
See Lot-cloth split.
Some dialects of North American English do not have this vowel.
See Cot-caught merger.
The North American variation of this sound is a rhotic vowel.
Many speakers of North American English do not distinguish between these two unstressed vowels.
For them, roses and Rosa's are pronounced the same, and the symbol usually used is schwa .
This sound is often transcribed with or with .
The diphthongs and are monophthongal for many General American speakers, as and .
The letter can represent either or the iotated vowel .
In BRP, if this iotated vowel occurs after , , or , it often triggers palatalization of the preceding consonant, turning it to , , and respectively, as in tune, during, sugar, and azure.
In American English, palatalization does not generally happen unless the is followed by r, with the result that turn to , , and respectively, as in nature, verdure, sure, and treasure.
Vowel length plays a phonetic role in the majority of English dialects, and is said to be phonemic in a few dialects, such as Australian English and New Zealand English.
In certain dialects of the modern English language, for instance General American, there is allophonic vowel length: vowel phonemes are realized as long vowel allophones before voiced consonant phonemes in the coda of a syllable.
Before the Great Vowel Shift, vowel length was phonemically contrastive.
This sound only occurs in non-rhotic accents.
In some accents, this sound may be, instead of , .
See English-language vowel changes before historic r.
This sound only occurs in non-rhotic accents.
In some accents, the schwa offglide of may be dropped, monophthising and lengthening the sound to .
See also IPA chart for English dialects for more vowel charts.
Consonants
This is the English consonantal system using symbols from the International Phonetic Alphabet (IPA).
The velar nasal is a non-phonemic allophone of /n/ in some northerly British accents, appearing only before /k/ and /g/.
In all other dialects it is a separate phoneme, although it only occurs in syllable codas.
The alveolar tap is an allophone of /t/ and /d/ in unstressed syllables in North American English and Australian English.
This is the sound of tt or dd in the words latter and ladder, which are homophones for many speakers of North American English.
In some accents such as Scottish English and Indian English it replaces .
This is the same sound represented by single r in most varieties of Spanish.
In some dialects, such as Cockney, the interdentals /θ/ and /ð/ are usually merged with /f/ and /v/, and in others, like African American Vernacular English, /ð/ is merged with dental /d/.
In some Irish varieties, /θ/ and /ð/ become the corresponding dental plosives, which then contrast with the usual alveolar plosives.
The sounds are labialised in some dialects.
Labialisation is never contrastive in initial position and therefore is sometimes not transcribed.
Most speakers of General American realize (always rhoticized) as the retroflex approximant , whereas the same is realized in Scottish English, etc. as the alveolar trill.
The voiceless palatal fricative /ç/ is in most accents just an allophone of /h/ before /j/; for instance human /çjuːmən/.
However, in some accents (see this), the /j/ is dropped, but the initial consonant is the same.
The voiceless velar fricative /x/ is used by Scottish or Welsh speakers of English for Scots/Gaelic words such as loch or by some speakers for loanwords from German and Hebrew like Bach or Chanukah /xanuka/. /x/ is also used in South African English.
In some dialects such as Scouse (Liverpool) either or the affricate may be used as an allophone of /k/ in words such as docker .
Most native speakers have a great deal of trouble pronouncing it correctly when learning a foreign language.
Most speakers use the sounds [k] and [h] instead.
Voiceless w is found in Scottish and Irish English, as well as in some varieties of American, New Zealand, and English English.
In most other dialects it is merged with /w/, in some dialects of Scots it is merged with /f/.
Voicing and aspiration
Voicing and aspiration of stop consonants in English depend on dialect and context, but a few general rules can be given:
Voiceless plosives and affricates (//, //, //, and //) are aspirated when they are word-initial or begin a stressed syllable — compare pin and spin , crap and scrap .
In some dialects, aspiration extends to unstressed syllables as well.
In other dialects, such as Indian English, all voiceless stops remain unaspirated.
Word-initial voiced plosives may be devoiced in some dialects.
Word-terminal voiceless plosives may be unreleased or accompanied by a glottal stop in some dialects (e.g. many varieties of American English) — examples: tap [], sack [].
Word-terminal voiced plosives may be devoiced in some dialects (e.g. some varieties of American English) — examples: sad [], bag [].
In other dialects they are fully voiced in final position, but only partially voiced in initial position.
Supra-segmental features
Tone groups
English is an intonation language. This means that the pitch of the voice is used syntactically, for example, to convey surprise and irony, or to change a statement into a question.
In English, intonation patterns are on groups of words, which are called tone groups, tone units, intonation groups or sense groups.
Tone groups are said on a single breath and, as a consequence, are of limited length, more often being on average five words long or lasting roughly two seconds.
For example:
- Do you need anything?
- I don't, no
- I don't know (contracted to, for example, - or I dunno in fast or colloquial speech that de-emphasises the pause between don't and know even further)
Characteristics of intonation
English is a strongly stressed language, in that certain syllables, both within words and within phrases, get a relative prominence/loudness during pronunciation while the others do not.
The former kind of syllables are said to be accentuated/stressed and the latter are unaccentuated/unstressed.
All good dictionaries of English mark the accentuated syllable(s) by either placing an apostrophe-like ( ) sign either before (as in IPA, Oxford English Dictionary, or Merriam-Webster dictionaries) or after (as in many other dictionaries) the syllable where the stress accent falls.
Hence in a sentence, each tone group can be subdivided into syllables, which can either be stressed (strong) or unstressed (weak).
The stressed syllable is called the nuclear syllable.
For example:
That | was | the | best | thing | you | could | have | done!
Here, all syllables are unstressed, except the syllables/words best and done, which are stressed.
Best is stressed harder and, therefore, is the nuclear syllable.
The nuclear syllable carries the main point the speaker wishes to make.
For example:
John had not stolen that money. (...
Someone else had.)
John had not stolen that money. (...
Someone said he had. or ...
Not at that time, but later he did.)
John had not stolen that money. (...
He acquired the money by some other means.)
John had not stolen that money. (...
He had stolen some other money.)
John had not stolen that money. (...
He had stolen something else.)
Also
I did not tell her that. (...
Someone else told her)
I did not tell her that. (...
You said I did. or ... but now I will)
I did not tell her that. (...
I did not say it; she could have inferred it, etc)
I did not tell her that. (...
I told someone else)
I did not tell her that. (...
I told her something else)
This can also be used to express emotion:
Oh really? (...I did not know that)
Oh really? (...I disbelieve you. or ...
That's blatantly obvious)
The nuclear syllable is spoken more loudly than the others and has a characteristic change of pitch.
The changes of pitch most commonly encountered in English are the rising pitch and the falling pitch, although the fall-rising pitch and/or the rise-falling pitch are sometimes used.
In this opposition between falling and rising pitch, which plays a larger role in English than in most other languages, falling pitch conveys certainty and rising pitch uncertainty.
This can have a crucial impact on meaning, specifically in relation to polarity, the positive–negative opposition; thus, falling pitch means "polarity known", while rising pitch means "polarity unknown".
This underlies the rising pitch of yes/no questions.
For example:
When do you want to be paid?
Now?
(Rising pitch.
In this case, it denotes a question: "Can I be paid now?" or "Do you desire to pay now?")
Now.
(Falling pitch.
In this case, it denotes a statement: "I choose to be paid now.")
Grammar
English grammar has minimal inflection compared with most other Indo-European languages.
For example, Modern English, unlike Modern German or Dutch and the Romance languages, lacks grammatical gender and adjectival agreement.
Case marking has almost disappeared from the language and mainly survives in pronouns.
The patterning of strong (e.g. speak/spoke/spoken) versus weak verbs inherited from its Germanic origins has declined in importance in modern English, and the remnants of inflection (such as plural marking) have become more regular.
At the same time, the language has become more analytic, and has developed features such as modal verbs and word order as resources for conveying meaning.
Auxiliary verbs mark constructions such as questions, negative polarity, the passive voice and progressive aspect.
Vocabulary
The English vocabulary has changed considerably over the centuries.
Like many languages deriving from Proto-Indo-European (PIE), many of the most common words in English can trace back their origin (through the Germanic branch) to PIE.
Such words include the basic pronouns I, from Old English ic, (cf. Latin ego, Greek ego, Sanskrit aham), me (cf. Latin me, Greek eme, Sanskrit mam), numbers (e.g. one, two, three, cf. Latin unus, duo, tres, Greek oinos "ace (on dice)", duo, treis), common family relationships such as mother, father, brother, sister etc (cf. Greek "meter", Latin "mater", Sanskrit "matṛ"; mother), names of many animals (cf. Sankrit mus, Greek mys, Latin mus; mouse), and many common verbs (cf. Greek gignōmi, Latin gnoscere, Hittite kanes; to know).
Germanic words (generally words of Old English or to a lesser extent Norse origin) tend to be shorter than the Latinate words of English, and more common in ordinary speech.
This includes nearly all the basic pronouns, prepositions, conjunctions, modal verbs etc. that form the basis of English syntax and grammar.
The longer Latinate words are often regarded as more elegant or educated.
However, the excessive use of Latinate words is considered at times to be either pretentious or an attempt to obfuscate an issue.
George Orwell's essay "Politics and the English Language" is critical of this, as well as other perceived misuse of the language.
An English speaker is in many cases able to choose between Germanic and Latinate synonyms: come or arrive; sight or vision; freedom or liberty.
In some cases there is a choice between a Germanic derived word (oversee), a Latin derived word (supervise), and a French word derived from the same Latin word (survey).
Such synonyms harbor a variety of different meanings and nuances, enabling the speaker to express fine variations or shades of thought.
Familiarity with the etymology of groups of synonyms can give English speakers greater control over their linguistic register.
See: List of Germanic and Latinate equivalents in English.
An exception to this and a peculiarity perhaps unique to English is that the nouns for meats are commonly different from, and unrelated to, those for the animals from which they are produced, the animal commonly having a Germanic name and the meat having a French-derived one.
Examples include: deer and venison; cow and beef; swine/pig and pork, or sheep and mutton.
This is assumed to be a result of the aftermath of the Norman invasion, where a French-speaking elite were the consumers of the meat, produced by Anglo-Saxon lower classes.
Since the majority of words used in informal settings will normally be Germanic, such words are often the preferred choices when a speaker wishes to make a point in an argument in a very direct way.
A majority of Latinate words (or at least a majority of content words) will normally be used in more formal speech and writing, such as a courtroom or an encyclopedia article.
However, there are other Latinate words that are used normally in everyday speech and do not sound formal; these are mainly words for concepts that no longer have Germanic words, and are generally assimilated better and in many cases do not appear Latinate.
For instance, the words mountain, valley, river, aunt, uncle, move, use, push and stay are all Latinate.
English easily accepts technical terms into common usage and often imports new words and phrases.
Examples of this phenomenon include: cookie, Internet and URL (technical terms), as well as genre, über, lingua franca and amigo (imported words/phrases from French, German, modern Latin, and Spanish, respectively).
In addition, slang often provides new meanings for old words and phrases.
In fact, this fluidity is so pronounced that a distinction often needs to be made between formal forms of English and contemporary usage.
See also: sociolinguistics.
Number of words in English
The General Explanations at the beginning of the Oxford English Dictionary states:
The vocabulary of English is undoubtedly vast, but assigning a specific number to its size is more a matter of definition than of calculation.
Unlike other languages, such as French, German, Spanish and Italian there is no Academy to define officially accepted words and spellings.
Neologisms are coined regularly in medicine, science and technology and other fields, and new slang is constantly developed.
Some of these new words enter wide usage; others remain restricted to small circles.
Foreign words used in immigrant communities often make their way into wider English usage.
Archaic, dialectal, and regional words might or might not be widely considered as "English".
The Oxford English Dictionary, 2nd edition (OED2) includes over 600,000 definitions, following a rather inclusive policy:
The editors of Webster's Third New International Dictionary, Unabridged (475,000 main headwords) in their preface, estimate the number to be much higher.
It is estimated that about 25,000 words are added to the language each year.
Word origins
One of the consequences of the French influence is that the vocabulary of English is, to a certain extent, divided between those words which are Germanic (mostly West Germanic, with a smaller influence from the North Germanic branch) and those which are "Latinate" (Latin-derived, either directly or from Norman French or other Romance languages).
Numerous sets of statistics have been proposed to demonstrate the origins of English vocabulary.
None, as yet, is considered definitive by most linguists.
A computerised survey of about 80,000 words in the old Shorter Oxford Dictionary (3rd ed.) was published in Ordered Profusion by Thomas Finkenstaedt and Dieter Wolff (1973) that estimated the origin of English words as follows:
Langue d'oïl, including French and Old Norman: 28.3%
Latin, including modern scientific and technical Latin: 28.24%
Other Germanic languages (including words directly inherited from Old English): 25%
Greek: 5.32%
No etymology given: 4.03%
Derived from proper names: 3.28%
All other languages contributed less than 1%
A survey by Joseph M. Williams in Origins of the English Language of 10,000 words taken from several thousand business letters gave this set of statistics:
French (langue d'oïl): 41%
"Native" English: 33%
Latin: 15%
Danish: 2%
Dutch: 1%
Other: 10%
However, 83% of the 1,000 most-common, and all of the 100 most-common English words are Germanic.
Dutch origins
Words describing the navy, types of ships, and other objects or activities on the water are often from Dutch origin.
Yacht (jacht) and cruiser (kruiser) are examples.
French origins
There are many words of French origin in English, such as competition, art, table, publicity, police, role, routine, machine, force, and many others that have been and are being anglicised; they are now pronounced according to English rules of phonology, rather than French.
A large portion of English vocabulary is of French or Langues d'oïl origin, most derived from, or transmitted via, the Anglo-Norman spoken by the upper classes in England for several hundred years after the Norman conquest of England.
Writing system
English has been written using the Latin alphabet since around the ninth century.
(Before that, Old English had been written using Anglo-Saxon runes.)
The spelling system, or orthography, is multilayered, with elements of French, Latin and Greek spelling on top of the native Germanic system; it has grown to vary significantly from the phonology of the language.
The spelling of words often diverges considerably from how they are spoken.
Though letters and sounds may not correspond in isolation, spelling rules that take into account syllable structure, phonetics, and accents are 75% or more reliable.
Some phonics spelling advocates claim that English is more than 80% phonetic.
In general, the English language, being the product of many other languages and having only been codified orthographically in the 16th century, has fewer consistent relationships between sounds and letters than many other languages.
The consequence of this orthographic history is that reading can be challenging.
It takes longer for students to become completely fluent readers of English than of many other languages, including French, Greek, and Spanish.
Basic sound-letter correspondence
Only the consonant letters are pronounced in a relatively regular way:
Written accents
Unlike most other Germanic languages, English has almost no diacritics except in foreign loanwords (like the acute accent in café), and in the uncommon use of a diaeresis mark (often in formal writing) to indicate that two vowels are pronounced separately, rather than as one sound (e.g. naïve, Zoë).
It is almost always acceptable to leave out the marks, especially in digital communications where the QWERTY keyboard lacks any marked letters, but it depends on the context where the word is used.
Some English words retain the diacritic to distinguish them from others, such as animé, exposé, lamé, öre, øre, pâté, piqué, and rosé, though these are sometimes also dropped (résumé/resumé is usually spelled resume in the United States).
There are loan words which occasionally use a diacritic to represent their pronunciation that is not in the original word, such as maté, from Spanish yerba mate, following the French usage, but they are extremely rare.
Formal written English
A version of the language almost universally agreed upon by educated English speakers around the world is called formal written English.
It takes virtually the same form no matter where in the English-speaking world it is written.
In spoken English, by contrast, there are a vast number of differences between dialects, accents, and varieties of slang, colloquial and regional expressions.
In spite of this, local variations in the formal written version of the language are quite limited, being restricted largely to the spelling differences between British and American English.
Basic and simplified versions
To make English easier to read, there are some simplified versions of the language.
One basic version is named Basic English, a constructed language with a small number of words created by Charles Kay Ogden and described in his book Basic English: A General Introduction with Rules and Grammar (1930).
The language is based on a simplified version of English.
Ogden said that it would take seven years to learn English, seven months for Esperanto, and seven weeks for Basic English, comparable with Ido.
Thus Basic English is used by companies who need to make complex books for international use, and by language schools that need to give people some knowledge of English in a short time.
Ogden did not put any words into Basic English that could be said with a few other words and he worked to make the words work for speakers of any other language.
He put his set of words through a large number of tests and adjustments.
He also made the grammar simpler, but tried to keep the grammar normal for English users.
The concept gained its greatest publicity just after the Second World War as a tool for world peace.
Although it was not built into a program, similar simplifications were devised for various international uses.
Another version, Simplified English, exists, which is a controlled language originally developed for aerospace industry maintenance manuals.
It offers a carefully limited and standardised subset of English.
Simplified English has a lexicon of approved words and those words can only be used in certain ways.
For example, the word close can be used in the phrase "Close the door" but not "do not go close to the landing gear".
Esperanto
is by far the most widely spoken constructed international auxiliary language in the world.
Its name derives from Doktoro Esperanto, the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887.
The word esperanto means 'one who hopes' in the language itself.
Zamenhof's goal was to create an easy and flexible language that would serve as a universal second language to foster peace and international understanding.
Esperanto has had continuous usage by a community estimated at between 100,000 and 2 million speakers for over a century.
By most estimates, there are approximately one thousand native speakers.
However, no country has adopted the language officially.
Today, Esperanto is employed in world travel, correspondence, cultural exchange, conventions, literature, language instruction, television, and radio broadcasting.
Also, there is an Esperanto Wikipedia that contains over 100,000 articles as of June 2008.
There is evidence that learning Esperanto may provide a good foundation for learning languages in general.
Some state education systems offer basic instruction and elective courses in Esperanto.
Esperanto is also the language of instruction in one university, the Akademio Internacia de la Sciencoj in San Marino.
History
Esperanto was developed in the late 1870s and early 1880s by ophthalmologist Dr. Ludovic Lazarus Zamenhof, an Ashkenazi Jew from Bialystok, now in Poland and previously in the Polish-Lithuanian Commonwealth, but at the time part of the Russian Empire.
After some ten years of development, which Zamenhof spent translating literature into the language as well as writing original prose and verse, the first book of Esperanto grammar was published in Warsaw in July 1887.
The number of speakers grew rapidly over the next few decades, at first primarily in the Russian empire and Eastern Europe, then in Western Europe, the Americas, China, and Japan.
In the early years, speakers of Esperanto kept in contact primarily through correspondence and periodicals, but in 1905 the first world congress of Esperanto speakers was held in Boulogne-sur-Mer, France.
Since then world congresses have been held in different countries every year, except during the two World Wars.
Since the Second World War, they have been attended by an average of over 2000 and up to 6000 people.
Relation to 20th-century totalitarianism
As a potential vehicle for international understanding, Esperanto attracted the suspicion of many totalitarian states.
The situation was especially pronounced in Nazi Germany and in the Soviet Union under Joseph Stalin.
In Germany, there was additional motivation to persecute Esperanto because Zamenhof was a Jew.
In his work Mein Kampf, Hitler mentioned Esperanto as an example of a language that would be used by an International Jewish Conspiracy once they achieved world domination.
Esperantists were executed during the Holocaust, with Zamenhof's family in particular singled out for execution.
In the early years of the Soviet Union, Esperanto was given a measure of government support, and an officially recognized Soviet Esperanto Association came into being.
However, in 1937, Stalin reversed this policy.
He denounced Esperanto as "the language of spies" and had Esperantists executed.
The use of Esperanto remained illegal until 1956.
Official use
Esperanto has never been an official language of any recognized country.
However, there were plans at the beginning of the 20th century to establish Neutral Moresnet as the world's first Esperanto state.
In China, there was talk in some circles after the 1911 Xinhai Revolution about officially replacing Chinese with Esperanto as a means to dramatically bring the country into the twentieth century, though this policy proved untenable.
In the summer of 1924, the American Radio Relay League adopted Esperanto as its official international auxiliary language, and hoped that the language would be used by radio amateurs in international communications, but its actual use for radio communications was negligible.
In addition, the self-proclaimed artificial island micronation of Rose Island used Esperanto as its official language in 1968.
Esperanto is the working language of several non-profit international organizations such as the Sennacieca Asocio Tutmonda, but most others are specifically Esperanto organizations.
The largest of these, the World Esperanto Association, has an official consultative relationship with the United Nations and UNESCO.
The U.S. Army has published military phrasebooks in Esperanto, to be used in wargames by mock enemy forces.
Esperanto is also the first language of teaching and administration of the International Academy of Sciences San Marino, which is sometimes called an "Esperanto University".
Linguistic properties
Classification
As a constructed language, Esperanto is not genealogically related to any ethnic language.
It has been described as "a language lexically predominantly Romanic, morphologically intensively agglutinative and to a certain degree isolating in character".
The phonology, grammar, vocabulary, and semantics are based on the western Indo-European languages.
The phonemic inventory is essentially Slavic, as is much of the semantics, while the vocabulary derives primarily from the Romance languages, with a lesser contribution from the Germanic languages.
Pragmatics and other aspects of the language not specified by Zamenhof's original documents were influenced by the native languages of early speakers, primarily Russian, Polish, German, and French.
Typologically, Esperanto has prepositions and a pragmatic word order that by default is Subject Verb Object and Adjective Noun.
New words are formed through extensive prefixing and suffixing.
Writing system
Esperanto is written with a modified version of the Latin alphabet, including six letters with diacritics: ĉ, ĝ, ĥ, ĵ, ŝ and ŭ (that is, c, g, h, j, s circumflex, and u breve).
The alphabet does not include the letters q, w, x, or y except in unassimilated foreign names.
The 28-letter alphabet is: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z
All letters are pronounced approximately as in the IPA, with the exception of c and the accented letters:
Two ASCII-compatible writing conventions are in use.
These substitute digraphs for the accented letters.
The original "h-convention" (ch, gh, hh, jh, sh, u) is based on English 'ch' and 'sh', while a more recent "x-convention" (cx, gx, hx, jx, sx, ux) is useful for alphabetic word sorting on a computer (cx comes correctly after cu, sx after sv, etc.) as well as for simple conversion back into the standard orthography.
Another scheme represents the superscripted letters by a caret (^), as for example: c^ or ^c.
Phonology
(For help with the phonetic symbols, see Help:IPA)
Esperanto has 22 consonants, 5 vowels, and two semivowels, which combine with the vowels to form 6 diphthongs.
(The consonant and semivowel are both written .)
Tone is not used to distinguish meanings of words.
Stress is always on the penultimate vowel, unless a final vowel o is elided, a practice which occurs mostly in poetry.
For example, familio "family" is stressed , but when found without the final o, famili’, the stress does not shift: .
Consonants
The 22 consonants are:
The sound is usually rolled, but may be tapped .
The has a normative pronunciation like an English v, but is sometimes somewhere between a v and a w, , depending on the language background of the speaker.
A semivowel normally occurs only in diphthongs after the vowels and , not as a consonant .
Common, if debated, assimilation includes the pronunciation of as , as in English sink, and as , like the x in English example.
A large number of consonant clusters can occur, up to three in initial position and four in medial position, as in instrui "to teach".
Final clusters are uncommon except in foreign names, poetic elision of final o, and a very few basic words such as cent "hundred" and post "after".
Vowels
Esperanto has the five cardinal vowels of Spanish, Swahili, and Modern Greek.
There are six falling diphthongs: uj, oj, ej, aj, aŭ, eŭ ().
With only five vowels, a good deal of variation is tolerated.
For instance, commonly ranges from (French é) to (French è).
The details often depend on the speaker's native language.
A glottal stop may occur between adjacent vowels in some people's speech, especially when the two vowels are the same, as in heroo "hero" ( or ) and praavo "great-grandfather" ( or ).
Grammar
Esperanto words are derived by stringing together prefixes, roots, and suffixes.
This process is regular, so that people can create new words as they speak and be understood.
Compound words are formed with a modifier-first, head-final order, the same order as English "birdsong" vs. "songbird".
The different parts of speech are marked by their own suffixes: all common nouns end in -o, all adjectives in -a, all derived adverbs in -e, and all verbs in one of six tense and mood suffixes, such as present tense -as.
Plural nouns end in -oj (pronounced "oy"), whereas direct objects end in -on.
Plural direct objects end with the combination -ojn (pronounced to rhyme with "coin"): That is, -o for a noun, plus -j for plural, plus -n for direct object.
Adjectives agree with their nouns; their endings are plural -aj (pronounced "eye"), direct-object -an, and plural direct-object -ajn (pronounced to rhyme with "fine").
The suffix -n is used to indicate the goal of movement and a few other things, in addition to the direct object.
See Esperanto grammar for details.
The six verb inflections consist of three tenses and three moods.
They are present tense -as, future tense -os, past tense -is, infinitive mood -i, conditional mood -us, and jussive mood -u (used for wishes and commands).
Verbs are not marked for person or number.
For instance: kanti "to sing"; mi kantas "I sing"; mi kantis "I sang"; mi kantos "I will sing"; li kantas "he sings"; vi kantas "you sing".
Word order is comparatively free: Adjectives may precede or follow nouns, and subjects, verbs and objects (marked by the suffix -n) may occur in any order.
However, the article la "the" and demonstratives such as tiu "this, that" almost always come before the noun, and a preposition such as ĉe "at" must come before it.
Similarly, the negative ne "not" and conjunctions such as kaj "both, and" and ke "that" must precede the phrase or clause they introduce.
In copular (A = B) clauses, word order is just as important as it is in English clauses like "people are dogs" vs. "dogs are people".
Correlatives
A correlative is a word used to ask or answer a question of who, where, what, when, or how.
Correlatives in Esperanto are set out in a systematic manner that correlates a basic idea (quantity, manner, time, etc.) to a function (questioning, indicating, negating, etc.)
Examples:
Kio estas tio?
"What is this?"
Kioma estas la horo?
"What time is it?"
Note kioma rather than Kiu estas la horo? "which is the hour?", when asking for the ranking order of the hour on the clock.
Io falis el la ŝranko "Something fell out of the cupboard."
Homoj tiaj kiel mi ne konadas timon.
"Men such as me know no fear."
Correlatives are declined if the case demands it:
Vi devas elekti ian vorton pli simpla "You should choose a (some kind of) simpler word."
Ia receives -n because it's part of the direct object.
Kian libron vi volas?
"What sort of book do you want?"
Contrast this with, Kiun libron vi volas?
"Which book do you want?"
Vocabulary
The core vocabulary of Esperanto was defined by Lingvo internacia, published by Zamenhof in 1887.
It comprised 900 roots, which could be expanded into tens of thousands of words with prefixes, suffixes, and compounding.
In 1894, Zamenhof published the first Esperanto dictionary, Universala Vortaro, with a larger set of roots.
However, the rules of the language allowed speakers to borrow new roots as needed, recommending only that they look for the most international forms, and then derive related meanings from these.
Since then, many words have been borrowed, primarily but not solely from the Western European languages.
Not all proposed borrowings catch on, but many do, especially technical and scientific terms.
Terms for everyday use, on the other hand, are more likely to be derived from existing roots—for example komputilo (a computer) from komputi (to compute) plus the suffix -ilo (tool)—or to be covered by extending the meanings of existing words (for example muso (a mouse), as in English, now also means a computer input device).
There are frequent debates among Esperanto speakers about whether a particular borrowing is justified or whether the need can be met by deriving from or extending the meaning of existing words.
In addition to the root words and the rules for combining them, a learner of Esperanto must memorize some idiomatic compounds that are not entirely straightforward.
For example, eldoni, literally "to give out", is used for "to publish" (a calque of words in several European languages with the same derivation), and vortaro, literally "a collection of words", means "a glossary" or "a dictionary".
Such forms are modeled after usage in some European languages, and speakers of other languages may find them illogical.
Fossilized derivations inherited from Esperanto's source languages may be similarly obscure, such as the opaque connection the root word centralo "power station" has with centro "center".
Compounds with -um- are overtly arbitrary, and must be learned individually, as -um- has no defined meaning.
It turns dekstren "to the right" into dekstrumen "clockwise", and komuna "common/shared" into komunumo "community", for example.
Nevertheless, there are not nearly as many idiomatic or slang words in Esperanto as in ethnic languages, as these tend to make international communication difficult, working against Esperanto's main goal.
Useful phrases
Here are some useful Esperanto phrases, with IPA transcriptions:
Hello: Saluton
What is your name?: Kiel vi nomiĝas?
My name is...: Mi nomiĝas...
How much (is it/are they)?: Kiom (estas)?
Here you are: Jen
Do you speak Esperanto?: Ĉu vi parolas Esperanton?
I do not understand you: Mi ne komprenas vin
I like this one: Ĉi tiu plaĉas al mi or Mi ŝatas tiun ĉi
Thank you: Dankon
You're welcome: Ne dankinde
Please: Bonvolu or mi petas
Here's to your health: Je via sano
Bless you!/Gesundheit!: Sanon!
Congratulations!: Gratulon!
Okay: Bone or Ĝuste
Yes: Jes
No: Ne
It is a nice day: Estas bela tago
I love you: Mi amas vin
Goodbye: Ĝis (la) (revido)
One beer, please: Unu bieron, mi petas.
What is that?: Kio estas tio?
That is...: Tio estas...
How are you?: Kiel vi (fartas)?
Good morning!: Bonan matenon!
Good evening!: Bonan vesperon!
Good night!: Bonan nokton!
Peace!: Pacon!
Sample text
The following short extract gives an idea of the character of Esperanto.
(Pronunciation is covered above.
The main point for English speakers to remember is that the letter 'J' has the sound of the letter 'Y' in English)
Esperanto text
En multaj lokoj de Ĉinio estis temploj de drako-reĝo. Dum trosekeco oni preĝis en la temploj, ke la drako-reĝo donu pluvon al la homa mondo.
Tiam drako estis simbolo de la supernatura estaĵo. Kaj pli poste, ĝi fariĝis prapatro de la plej altaj regantoj kaj simbolis la absolutan aŭtoritaton de feŭda imperiestro.
La imperiestro pretendis, ke li estas filo de la drako. Ĉiuj liaj vivbezonaĵoj portis la nomon drako kaj estis ornamitaj per diversaj drakofiguroj.
Nun ĉie en Ĉinio videblas drako-ornamentaĵoj kaj cirkulas legendoj pri drakoj.
English Translation:
In many places in China there were temples of the dragon king.
During times of drought, people prayed in the temples, that the dragon king would give rain to the human world.
At that time the dragon was a symbol of the supernatural.
Later on, it became the ancestor of the highest rulers and symbolised the absolute authority of the feudal emperor.
The emperor claimed to be the son of the dragon.
All of his personal possessions carried the name dragon and were decorated with various dragon figures.
Now everywhere in China dragon decorations can be seen and there circulate legends about dragons.
Education
The majority of Esperanto speakers learn the language through self-directed study, online tutorials, and correspondence courses taught by volunteers.
In more recent years, teaching websites like lernu! have become popular.
Esperanto instruction is occasionally available at schools, such as a pilot project involving four primary schools under the supervision of the University of Manchester, and by one count at 69 universities.
However, outside of China and Hungary, these mostly involve informal arrangements rather than dedicated departments or state sponsorship.
Eötvös Loránd University in Budapest had a department of Interlinguistics and Esperanto from 1966 to 2004, after which time instruction moved to vocational colleges; there are state examinations for Esperanto instructors.
Various educators have estimated that Esperanto can be learned in anywhere from one quarter to one twentieth the amount of time required for other languages.
Some argue, however, that this is only true for native speakers of Western European languages.
Claude Piron, a psychologist formerly at the University of Geneva and Chinese-English-Russian-Spanish translator for the United Nations, argued that Esperanto is far more "brain friendly" than many ethnic languages.
"Esperanto relies entirely on innate reflexes [and] differs from all other languages in that you can always trust your natural tendency to generalize patterns. [...]
The same neuropsychological law [— called by] Jean Piaget generalizing assimilation — applies to word formation as well as to grammar."
Language acquisition
Four primary schools in Britain, with some 230 pupils, are currently following a course in "propedeutic Esperanto", under the supervision of the University of Manchester.
That is, instruction in Esperanto to raise language awareness and accelerate subsequent learning of foreign languages.
Several studies demonstrate that studying Esperanto before another foreign language speeds and improves learning the second language to a greater extent than other languages which have been investigated.
This appears to be because learning subsequent foreign languages is easier than learning one's first, while the use of a grammatically simple and culturally flexible auxiliary language like Esperanto lessens the first-language learning hurdle.
In one study, a group of European secondary school students studied Esperanto for one year, then French for three years, and ended up with a significantly better command of French than a control group, who studied French for all four years.
Similar results were found when the course of study was reduced to two years, of which six months was spent learning Esperanto.
Results are not yet available from a study in Australia to see if similar benefits would occur for learning East Asian languages, but the pupils taking Esperanto did better and enjoyed the subject more than those taking other languages.
Community
Geography and demography
Esperanto speakers are more numerous in Europe and East Asia than in the Americas, Africa, and Oceania, and more numerous in urban than in rural areas.
Esperanto is particularly prevalent in the northern and eastern countries of Europe; in China, Korea, Japan, and Iran within Asia; in Brazil, Argentina, and Mexico in the Americas; and in Togo in Africa.
Number of speakers
An estimate of the number of Esperanto speakers was made by the late Sidney S. Culbert, a retired psychology professor at the University of Washington and a longtime Esperantist, who tracked down and tested Esperanto speakers in sample areas in dozens of countries over a period of twenty years.
Culbert concluded that between one and two million people speak Esperanto at Foreign Service Level 3, "professionally proficient" (able to communicate moderately complex ideas without hesitation, and to follow speeches, radio broadcasts, etc.).
Culbert's estimate was not made for Esperanto alone, but formed part of his listing of estimates for all languages of over 1 million speakers, published annually in the World Almanac and Book of Facts.
Culbert's most detailed account of his methodology is found in a 1989 letter to David Wolff .
Since Culbert never published detailed intermediate results for particular countries and regions, it is difficult to independently gauge the accuracy of his results.
In the Almanac, his estimates for numbers of language speakers were rounded to the nearest million, thus the number for Esperanto speakers is shown as 2 million.
This latter figure appears in Ethnologue.
Assuming that this figure is accurate, that means that about 0.03% of the world's population speaks the language.
This falls short of Zamenhof's goal of a universal language, but it represents a level of popularity unmatched by any other constructed language.
Marcus Sikosek (now Ziko van Dijk) has challenged this figure of 1.6 million as exaggerated.
He estimated that even if Esperanto speakers were evenly distributed, assuming one million Esperanto speakers worldwide would lead one to expect about 180 in the city of Cologne.
Van Dijk finds only 30 fluent speakers in that city, and similarly smaller than expected figures in several other places thought to have a larger-than-average concentration of Esperanto speakers.
He also notes that there are a total of about 20,000 members of the various Esperanto organizations (other estimates are higher).
Though there are undoubtedly many Esperanto speakers who are not members of any Esperanto organization, he thinks it unlikely that there are fifty times more speakers than organization members.
Finnish linguist Jouko Lindstedt, an expert on native-born Esperanto speakers, presented the following scheme to show the overall proportions of language capabilities within the Esperanto community:
1,000 have Esperanto as their native language
10,000 speak it fluently
100,000 can use it actively
1,000,000 understand a large amount passively
10,000,000 have studied it to some extent at some time.
In the absence of Dr. Culbert's detailed sampling data, or any other census data, it is impossible to state the number of speakers with certainty.
Few observers, probably, would challenge the following statement from the website of the World Esperanto Association:
Numbers of textbooks sold and membership of local societies put the number of people with some knowledge of the language in the hundreds of thousands and possibly millions.
Native speakers
Ethnologue reports estimates that there are 200 to 2000 native Esperanto speakers (denaskuloj), who have learned the language from birth from their Esperanto-speaking parents.
This usually happens when Esperanto is the chief or only common language in an international family, but sometimes in a family of devoted Esperantists.
The most famous native speaker of Esperanto is businessman George Soros.
Also notable is young Holocaust victim Petr Ginz, whose drawing of the planet Earth as viewed from the moon was carried aboard the Space Shuttle Columbia in 2003 (STS-107).
Culture
Esperanto speakers can access an international culture, including a large body of original as well as translated literature.
There are over 25,000 Esperanto books, both originals and translations, as well as several regularly distributed Esperanto magazines.
Esperanto speakers use the language for free accommodations with Esperantists in 92 countries using the Pasporta Servo or to develop pen pal friendships abroad through the Esperanto Pen Pal Service.
Every year, 1,500-3,000 Esperanto speakers meet for the World Congress of Esperanto (Universala Kongreso de Esperanto).
The European Esperanto Union (Eǔropa Esperanto-Unio) regroups the national Esperanto associations of the EU member states and holds congresses every two years.
The most recent was in Maribor, Slovenia, in July-August 2007.
It attracted 256 delegates from 28 countries, including 2 members of the European Parliament, Ms. Małgorzata Handzlik of Poland and Ms. Ljudmila Novak of Slovenia.
Historically, much Esperanto music has been in various folk traditions, such as Kaj Tiel Plu, for example.
In recent decades, more rock and other modern genres have appeared, an example being the Swedish band Persone.
There are also shared traditions, such as Zamenhof Day, and shared behaviour patterns.
Esperantists speak primarily in Esperanto at international Esperanto meetings.
Detractors of Esperanto occasionally criticize it as "having no culture".
Proponents, such as Prof. Humphrey Tonkin of the University of Hartford, observe that Esperanto is "culturally neutral by design, as it was intended to be a facilitator between cultures, not to be the carrier of any one national culture."
The late Scottish Esperanto author William Auld has written extensively on the subject, arguing that Esperanto is "the expression of a common human culture, unencumbered by national frontiers.
Thus it is considered a culture on its own."
Others point to Esperanto's potential for strengthening a common European identity, as it combines features of several European languages.
In popular culture
Esperanto has been used in a number of films and novels.
Typically, this is done either to add the exotic flavour of a foreign language without representing any particular ethnicity, or to avoid going to the trouble of inventing a new language.
The Charlie Chaplin film The Great Dictator (1940) showed Jewish ghetto shops designated in Esperanto, each with the general Esperanto suffix -ejo (meaning "place for..."), in order to convey the atmosphere of some 'foreign' East European country without referencing any particular East European language.
Two full-length feature films have been produced with dialogue entirely in Esperanto: Angoroj, in 1964, and Incubus, a 1965 B-movie horror film.
Canadian actor William Shatner learned Esperanto to a limited level so that he could star in Incubus.
Other amateur productions have been made, such as a dramatisation of the novel Gerda Malaperis (Gerda Has Disappeared).
A number of "mainstream" films in national languages have used Esperanto in some way, such as Gattaca (1997), in which Esperanto can be overheard on the public address system.
In the 1994 film Street Fighter, Esperanto is the native language of the fictional country of Shadaloo, and in a barracks scene the soldiers of villain M. Bison sing a rousing Russian Army-style chorus, the "Bison Troopers Marching Song", in the language.
Esperanto is also spoken and appears on signs in the film Blade: Trinity.
In the British comedy Red Dwarf, Arnold Rimmer is seen attempting to learn Esperanto in a number of early episodes, including Kryten.
In the first season, signs on the titular spacecraft are in both English and Esperanto.
Esperanto is used as the universal language in the far future of Harry Harrison's Stainless Steel Rat and Deathworld stories.
In a 1969 guest appearance on The Tonight Show, Jay Silverheels of The Lone Ranger fame appeared in character as Tonto for a comedy sketch with Johnny Carson, and claimed Esperanto skills as he sought new employment.
The sketch ended with a statement of his ideal situation: "Tonto, to Toronto, for Esperanto, and pronto!"
Also, in the Danny Phantom Episode, "Public Enemies", Danny, Tucker, and Sam come across a ghost wolf who speaks Esperanto, but only Tucker can understand at first.
In Science
In 1921 the French Academy of Sciences recommended using Esperanto for international scientific communication.
A few scientists and mathematicians, such as Maurice Fréchet (mathematics), John C. Wells (linguistics), Helmar Frank (pedagogy and cybernetics), and Nobel laureate Reinhard Selten (economics) have published part of their work in Esperanto.
Frank and Selten were among the founders of the International Academy of Sciences in San Marino, sometimes called the "Esperanto University", where Esperanto is the primary language of teaching and administration.
Goals of the movement
Zamenhof's intention was to create an easy-to-learn language to foster international understanding.
It was to serve as an international auxiliary language, that is, as a universal second language, not to replace ethnic languages.
This goal was widely shared among Esperanto speakers in the early decades of the movement.
Later, Esperanto speakers began to see the language and the culture that had grown up around it as ends in themselves, even if Esperanto is never adopted by the United Nations or other international organizations.
Those Esperanto speakers who want to see Esperanto adopted officially or on a large scale worldwide are commonly called finvenkistoj, from fina venko, meaning "final victory", or pracelistoj, from pracelo, meaning "original goal".
Those who focus on the intrinsic value of the language are commonly called raŭmistoj, from Rauma, Finland, where a declaration on the near-term unlikelihood of the "fina venko" and the value of Esperanto culture was made at the International Youth Congress in 1980.
These categories are, however, not mutually exclusive.
The Prague Manifesto (1996) presents the views of the mainstream of the Esperanto movement and of its main organisation, the World Esperanto Association (UEA).
Symbols and flags
In 1893, C. Rjabinis and P. Deullin designed and manufactured a lapel pin for Esperantists to identify each other.
The design was a circular pin with a white background and a five pointed green star.
The theme of the design was the hope of the five continents being united by a common language.
The earliest flag, and the one most commonly used today, features a green five-pointed star against a white canton, upon a field of green.
It was proposed to Zamenhof by Irishman Richard Geoghegan, author of the first Esperanto textbook for English speakers, in 1887.
In 1905, delegates to the first conference of Esperantists at Boulogne-sur-Mer unanimously approved a version that differed from the modern flag only by the superimposition of an "E" over the green star.
Other variants include that for Christian Esperantists, with a white Christian cross superimposed upon the green star, and that for Leftists, with the color of the field changed from green to red.
In 1987, a second flag design was chosen in a contest organized by the UEA celebrating the first centennial of the language.
It featured a white background with two stylised curved "E"s facing each other.
Dubbed the "jubilea simbolo" (jubilee symbol) , it attracted criticism from some Esperantists, who dubbed it the "melono" (melon) because of the design's elliptical shape.
It is still in use, though to a lesser degree than the traditional symbol, known as the "verda stelo" (green star).
Religion
Esperanto has served an important role in several religions, such as Oomoto from Japan and Baha'i from Iran, and has been encouraged by others.
Oomoto
The Oomoto religion encourages the use of Esperanto among their followers and includes Zamenhof as one of its deified spirits.
Bahá'í Faith
The Bahá'í Faith encourages the use of an auxiliary international language.
While endorsing no specific language, some Bahá'ís see Esperanto as having great potential in this role.
Lidja Zamenhof, the daughter of Esperanto founder L. L. Zamenhof, became a Bahá'í.
Various volumes of the Bahá'í literatures and other Baha'i books have been translated into Esperanto.
Spiritism
Esperanto is also actively promoted, at least in Brazil, by followers of Spiritism.
The Brazilian Spiritist Federation publishes Esperanto coursebooks, translations of Spiritism's basic books, and encourages Spiritists to become Esperantists.
Bible translations
The first translation of the Bible into Esperanto was a translation of the Tanach or Old Testament done by L. L. Zamenhof.
The translation was reviewed and compared with other languages' translations by a group of British clergy and scholars before publishing it at the British and Foreign Bible Society in 1910.
In 1926 this was published along with a New Testament translation, in an edition commonly called the "Londona Biblio".
In the 1960s, the Internacia Asocio de Bibliistoj kaj Orientalistoj tried to organize a new, ecumenical Esperanto Bible version.
Since then, the Dutch Lutheran pastor Gerrit Berveling has translated the Deuterocanonical or apocryphal books in addition to new translations of the Gospels, some of the New Testament epistles, and some books of the Tanakh or Old Testament.
These have been published in various separate booklets, or serialized in Dia Regno, but the Deuterocanonical books have appeared in recent editions of the Londona Biblio.
Christianity
Two Roman Catholic popes, John Paul II and Benedict XVI, have regularly used Esperanto in their multilingual urbi et orbi blessings at Easter and Christmas each year since Easter 1994.
Christian Esperanto organizations include two that were formed early in the history of Esperanto, the International Union of Catholic Esperantists and the International Christian Esperantists League.
An issue of "The Friend" describes the activities of the Quaker Esperanto Society.
There are instances of Christian apologists and teachers who use Esperanto as a medium.
Nigerian Pastor Bayo Afolaranmi's "Spirita nutraĵo" (spiritual food) Yahoo mailing list, for example, has hosted weekly messages since 2003.
Chick Publications, publisher of Protestant fundamentalist themed evangelistic tracts, has published a number of comic book style tracts by Jack T. Chick translated into Esperanto, including "This Was Your Life!"
("Jen Via Tuto Vivo!")
Islam
Ayatollah Khomeini of Iran called on Muslims to learn Esperanto and praised its use as a medium for better understanding among peoples of different religious backgrounds.
After he suggested that Esperanto replace English as an international lingua franca, it began to be used in the seminaries of Qom.
An Esperanto translation of the Qur'an was published by the state shortly thereafter.
In 1981, Khomeini and the Iranian government began to oppose Esperanto after realising that followers of the Bahá'í Faith were interested in it.
Criticism
Esperanto was conceived as a language of international communication, more precisely as a universal second language.
Since publication, there has been debate over whether it is possible for Esperanto to attain this position, and whether it would be an improvement for international communication if it did.
There have been a number of attempts to reform the language, the most well-known of which is the language Ido which resulted in a schism in the community at the time, beginning in 1907.
Since Esperanto is a planned language, there have been many, often passionate, criticisms of minor points which are too numerous to cover here, such as Zamenhof's choice of the word edzo over something like spozo for "husband, spouse", or his choice of the Classic Greek and Old Latin singular and plural endings -o, -oj, -a, -aj over their Medieval contractions -o, -i, -a, -e.
(Both these changes were adopted by the Ido reform, though Ido dispensed with adjectival agreement altogether.)
See the links below for examples of more general criticism.
The more common points include:
Esperanto has failed the expectations of its founder to become a universal second language.
Although many promoters of Esperanto stress the few successes it has had, the fact remains that well over a century since its publication, the portion of the world that speaks Esperanto, and the number of primary and secondary schools which teach it, remain minuscule.
It simply cannot compete with English in this regard.
The vocabulary and grammar are based on major European languages, and are not universal.
Often this criticism is specific to a few points such as adjectival agreement and the accusative case (generally such obvious details are all that reform projects suggest changing), but sometimes it is more general: Both the grammar and the 'international' vocabulary are difficult for many Asians, among others, and give an unfair advantage to speakers of European languages.
One attempt to address this issue is Lojban, which draws from the six populous languages Arabic, Chinese, English, Hindi, Russian, and Spanish, and whose grammar is designed for computer parsing.
The vocabulary, diacritic letters, and grammar are too dissimilar from the major Western European languages, and therefore Esperanto is not as easy as it could be for speakers of those languages to learn.
Attempts to address this issue include the younger planned languages Ido and Interlingua.
Esperanto phonology is unimaginatively provincial, being essentially Belorussian with regularized stress, leaving out only the nasal vowels, palatalized consonants, and /dz/.
For example, Esperanto has phonemes such as (ĥ, ĵ, c, eŭ) which are rare as distinct phonemes outside Europe.
(Note that none of these are found in initial position in English.)
Esperanto has no culture.
Although it has a large international literature, Esperanto does not encapsulate a specific culture.
Esperanto is culturally European.
This is due to the European derivation of its vocabulary, and more insidiously, its semantics; both infuse the language with a European world view.
The vocabulary is too large.
Rather than deriving new words from existing roots, large numbers of new roots are adopted into the language by people who think they're international, when in fact they're only European.
This makes the language much more difficult for non-Europeans than it needs to be.
Esperanto is sexist.
As in English, there is no neutral pronoun for s/he, and most kin terms and titles are masculine by default and only feminine when so specified.
There have been many attempts to address this issue, of which one of the better known is Riism.
Esperanto is, looks, or sounds artificial.
This criticism is primarily due to the letters with circumflex diacritics, which some find odd or cumbersome, and to the lack of fluent speakers: Few Esperantists have spent much time with fluent, let alone native, speakers, and many learn Esperanto relatively late in life, and so speak haltingly, which can create a negative impression among non-speakers.
Among fluent speakers, Esperanto sounds no more artificial than any other language.
Others claim that an artificial language will necessarily be deficient, due to its very nature, but the Hungarian Academy of Sciences has found that Esperanto fulfills all the requirements of a living language.
Modifications
Though Esperanto itself has changed little since the publication of the Fundamento de Esperanto (Foundation of Esperanto), a number of reform projects have been proposed over the years, starting with Zamenhof's proposals in 1894 and Ido in 1907.
Several later constructed languages, such as Fasile, were based on Esperanto.
In modern times, attempts have been made to eliminate perceived sexism in the language.
One example of this is Riism.
However, as Esperanto has become a living language, changes are as difficult to implement as in ethnic languages.
Formal grammar
In formal semantics, computer science and linguistics, a formal grammar (also called formation rules) is a precise description of a formal language – that is, of a set of strings over some alphabet.
In other words, a grammar describes which of the possible sequences of symbols (strings) in a language constitute valid words or statements in that language, but it does not describe their semantics (i.e. what they mean).
The branch of mathematics that is concerned with the properties of formal grammars and languages is called formal language theory.
A grammar is usually regarded as a means to generate all the valid strings of a language; it can also be used as the basis for a recognizer that determines for any given string whether it is grammatical (i.e. belongs to the language).
To describe such recognizers, formal language theory uses separate formalisms, known as automata.
A grammar can also be used to analyze the strings of a language – i.e. to describe their internal structure.
In computer science, this process is known as parsing.
Most languages have very compositional semantics, i.e. the meaning of their utterances is structured according to their syntax; therefore, the first step to describing the meaning of an utterance in language is to analyze it and look at its analyzed form (known as its parse tree in computer science, and as its deep structure in generative grammar).
Background
Formal language
A formal language is an organized set of symbols the essential feature of which is that it can be precisely defined in terms of just the shapes and locations of those symbols.
Such a language can be defined, then, without any reference to any meanings of any of its expressions; it can exist before any formal interpretation is assigned to it -- that is, before it has any meaning.
First order logic is expressed in some formal language.
A formal grammar determines which symbols and sets of symbols are formulas in a formal language.
Formal systems
A formal system (also called a logical calculus, or a logical system) consists of a formal language together with a deductive apparatus (also called a deductive system).
The deductive apparatus may consist of a set of transformation rules (also called inference rules) or a set of axioms, or have both.
A formal system is used to derive one expression from one or more other expressions.
Formal proofs
A formal proof is a sequence of well-formed formulas of a formal language, the last one of which is a theorem of a formal system.
The theorem is a syntactic consequence of all the wffs preceding it in the proof.
For a wff to qualify as part of a proof, it must be the result of applying a rule of the deductive apparatus of some formal system to the previous wffs in the proof sequence.
Formal interpretations
An interpretation of a formal system is the assignment of meanings to the symbols, and truth-values to the sentences of a formal system.
The study of formal interpretations is called formal semantics.
Giving an interpretation is synonymous with constructing a model.
Formal grammars
A grammar mainly consists of a set of rules for transforming strings.
(If it only consisted of these rules, it would be a semi-Thue system.)
To generate a string in the language, one begins with a string consisting of only a single start symbol, and then successively applies the rules (any number of times, in any order) to rewrite this string.
The language consists of all the strings that can be generated in this manner.
Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language.
If there are multiple ways of generating the same single string, then the grammar is said to be ambiguous.
For example, assume the alphabet consists of and , the start symbol is and we have the following rules:
1.
2.
then we start with , and can choose a rule to apply to it.
If we choose rule 1, we obtain the string .
If we choose rule 1 again, we replace with and obtain the string .
This process can be repeated at will until all occurrences of S are removed, and only symbols from the alphabet remain (i.e., and ).
For example, if we now choose rule 2, we replace with and obtain the string , and are done.
We can write this series of choices more briefly, using symbols: .
The language of the grammar is the set of all the strings that can be generated using this process: .
Formal definition
In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s, a grammar G consists of the following components:
A finite set of nonterminal symbols.
A finite set of terminal symbols that is disjoint from .
A finite set of production rules, each of the form
where is the Kleene star operator and denotes set union.
That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol.
In the case that the second string is the empty string – that is, that it contains no symbols at all – in order to avoid confusion, the empty string is often denoted with a special notation, often (, or .
A distinguished symbol that is the start symbol.
A grammar is formally defined as the ordered quad-tuple .
Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.
The operation of a grammar can be defined in terms of relations on strings:
Given a grammar , the binary relation (pronounced as "G derives in one step") on strings in is defined by:
the relation (pronounced as G derives in zero or more steps) is defined as the transitive closure of
the language of , denoted as , is defined as all those strings over that can be generated by starting with the start symbol and then applying the production rules in until no more nonterminal symbols are present; that is, the set .
Note that the grammar is effectively the semi-Thue system , rewriting strings in exactly the same way; the only difference is in that we distinguish specific nonterminal symbols which must be rewritten in rewrite rules, and are only interested in rewritings from the designated start symbol to strings without nonterminal symbols.
Example
For these examples, formal languages are specified using set-builder notation.
Consider the grammar where , , is the start symbol, and consists of the following production rules:
1.
2.
3.
4.
Some examples of the derivation of strings in are:
(Note on notation: reads "L generates R by means of production i" and the generated part is each time indicated in bold.)
This grammar defines the language where denotes a string of n consecutive 's.
Thus, the language is the set of strings that consist of 1 or more 's, followed by the same number of 's, followed by the same number of 's.
The Chomsky hierarchy
When Noam Chomsky first formalized generative grammars in 1956, he classified them into types now known as the Chomsky hierarchy.
The difference between these types is that they have increasingly strict production rules and can express fewer formal languages.
Two important types are context-free grammars (Type 2) and regular grammars (Type 3).
The languages that can be described with such a grammar are called context-free languages and regular languages, respectively.
Although much less powerful than unrestricted grammars (Type 0), which can in fact express any language that can be accepted by a Turing machine, these two restricted types of grammars are most often used because parsers for them can be efficiently implemented.
For example, all regular languages can be recognized by a finite state machine, and for useful subsets of context-free grammars there are well-known algorithms to generate efficient LL parsers and LR parsers to recognize the corresponding languages those grammars generate.
Context-free grammars
A context-free grammar is a grammar in which the left-hand side of each production rule consists of only a single nonterminal symbol.
This restriction is non-trivial; not all languages can be generated by context-free grammars.
Those that can are called context-free languages.
The language defined above is not a context-free language, and this can be strictly proven using the pumping lemma for context-free languages, but for example the language (at least 1 followed by the same number of 's) is context-free, as it can be defined by the grammar with , , the start symbol, and the following production rules:
1.
2.
A context-free language can be recognized in time (see Big O notation) by an algorithm such as Earley's algorithm.
That is, for every context-free language, a machine can be built that takes a string as input and determines in time whether the string is a member of the language, where is the length of the string.
Further, some important subsets of the context-free languages can be recognized in linear time using other algorithms.
Regular grammars
In regular grammars, the left hand side is again only a single nonterminal symbol, but now the right-hand side is also restricted: It may be the empty string, or a single terminal symbol, or a single terminal symbol followed by a nonterminal symbol, but nothing else.
(Sometimes a broader definition is used: one can allow longer strings of terminals or single nonterminals without anything else, making languages easier to denote while still defining the same class of languages.)
The language defined above is not regular, but the language (at least 1 followed by at least 1 , where the numbers may be different) is, as it can be defined by the grammar with , , the start symbol, and the following production rules:
All languages generated by a regular grammar can be recognized in linear time by a finite state machine.
Although, in practice, regular grammars are commonly expressed using regular expressions, some forms of regular expression used in practice do not strictly generate the regular languages and do not show linear recognitional performance due to those deviations.
Other forms of generative grammars
Many extensions and variations on Chomsky's original hierarchy of formal grammars have been developed more recently, both by linguists and by computer scientists, usually either in order to increase their expressive power or in order to make them easier to analyze or parse.
Some forms of grammars developed include:
Tree-adjoining grammars increase the expressiveness of conventional generative grammars by allowing rewrite rules to operate on parse trees instead of just strings.
Affix grammars and attribute grammars allow rewrite rules to be augmented with semantic attributes and operations, useful both for increasing grammar expressiveness and for constructing practical language translation tools.
Analytic grammars
Though there is very little literature on parsing algorithms, most of these algorithms assume that the language to be parsed is initially described by means of a generative formal grammar, and that the goal is to transform this generative grammar into a working parser.
Strictly speaking, a generative grammar does not in any way correspond to the algorithm used to parse a language, and various algorithms have different restrictions on the form of production rules that are considered well-formed.
An alternative approach is to formalize the language in terms of an analytic grammar in the first place, which more directly corresponds to the structure and semantics of a parser for the language.
Examples of analytic grammar formalisms include the following:
The Language Machine directly implements unrestricted analytic grammars.
Substitution rules are used to transform an input to produce outputs and behaviour.
The system can also produce the lm-diagram which shows what happens when the rules of an unrestricted analytic grammar are being applied.
Top-down parsing language (TDPL): a highly minimalist analytic grammar formalism developed in the early 1970s to study the behavior of top-down parsers.
Link grammars: a form of analytic grammar designed for linguistics, which derives syntactic structure by examining the positional relationships between pairs of words.
Parsing expression grammars (PEGs): a more recent generalization of TDPL designed around the practical expressiveness needs of programming language and compiler writers.
Free software
Free software or software libre is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with minimal restrictions only to ensure that further recipients can also do these things.
In practice, for software to be distributed as free software, the human readable form of the program (the "source code") must be made available to the recipient along with a notice granting the above permissions.
Such a notice is a "free software licence", or, in theory, could be a notice saying that the source code is released into the public domain.
The free software movement was conceived in 1983 by Richard Stallman to make these freedoms available to every computer user.
From the late 1990s onward, alternative terms for free software came into use.
"Open source software" is the most common such alternative term.
Others include "software libre", "free, libre and open-source software" ("FOSS", or, with "libre", "FLOSS").
The antonym of free software is "proprietary software" or non-free software.
Free software is distinct from "freeware" which is proprietary software made available free of charge.
Users usually cannot study, modify, or redistribute freeware.
Since free software may be freely redistributed, it generally is available at little or no cost.
Free software business models are usually based on adding value such as support, training, customization, integration, or certification.
At the same time, some business models which work with proprietary software are not compatible with free software, such as those that depend on a user paying for a licence in order to lawfully use a software product.
History
In the 1950s, 1960s, and 1970s, it was normal for computer users to have the freedoms that are provided by free software.
Software was commonly shared by individuals who used computers and by hardware manufacturers who were glad that people were making software that made their hardware useful.
In the 1970s and early 1980s, the software industry began using technical measures (such as only distributing binary copies of computer programs) to prevent computer users from being able to study and modify software..
In 1980 copyright law was extended to computer programs.
In 1983, Richard Stallman, longtime member of the hacker community at the MIT Artificial Intelligence Laboratory, announced the GNU project, saying that he had become frustrated with the effects of the change in culture of the computer industry and its users.
Software development for the GNU operating system began in January 1984, and the Free Software Foundation (FSF) was founded in October 1985.
He developed a free software definition and the concept of "copyleft", designed to ensure software freedom for all.
Free software is a widespread international concept, producing software used by individuals, large organizations, and governmental administrations.
Free software has a very high market penetration in server-side Internet applications such as the Apache web server, MySQL database, and PHP scripting language.
Completely free computing environments are available as large packages of basic system software, such as the many GNU/Linux distributions and FreeBSD.
Free software developers have also created free versions of almost all commonly used desktop applications, including Web browsers, office productivity suites, and multimedia players.
It is important to note, however, that in many categories, free software for individual workstations or home users has only a fraction of the market share of its proprietary competitors.
Most free software is distributed online without charge, or off-line at the marginal cost of distribution, but this pricing model is not required, and people may sell copies of free software programs for any price.
The economic viability of free software has been recognised by large corporations such as IBM, Red Hat, and Sun Microsystems.
Many companies whose core business is not in the IT sector choose free software for their Internet information and sales sites, due to the lower initial capital investment and ability to freely customize the application packages.
Also, some non-software industries are beginning to use techniques similar to those used in free software development for their research and development process; scientists, for example, are looking towards more open development processes, and hardware such as microchips are beginning to be developed with specifications released under copyleft licenses (see the OpenCores project, for instance).
Creative Commons and the free culture movement have also been largely influenced by the free software movement.
Naming
The FSF recommends using the term "free software" rather than "open source software" because that term and the associated marketing campaign focuses on the technical issues of software development, avoiding the issue of user freedoms.
"Libre" is used to avoid the ambiguity of the word "free".
However, amongst English speakers, libre is primarily only used within the free software movement.
Definition
The first formal definition of free software was published by FSF in February 1986.
That definition, written by Richard Stallman, is still maintained today and states that software is free software if people who receive a copy of the software have the following four freedoms:
Freedom 0: The freedom to run the program for any purpose.
Freedom 1: The freedom to study and modify the program.
Freedom 2: The freedom to copy the program so you can help your neighbor.
Freedom 3: The freedom to improve the program, and release your improvements to the public, so that the whole community benefits.
Freedoms 1 and 3 require source code to be available because studying and modifying software without its source code is highly impractical.
Thus, free software means that computer users have the freedom to cooperate with whom they choose, and to control the software they use.
To summarize this into a remark distinguishing libre (freedom) software from gratis (zero price) software, Richard Stallman said: "Free software is a matter of liberty, not price.
To understand the concept, you should think of 'free' as in 'free speech', not as in 'free beer'".
In the late 90s, other groups published their own definitions which describe an almost identical set of software.
The most notable are Debian Free Software Guidelines published in 1997, and the Open Source Definition, published in 1998.
The BSD-based operating systems, such as FreeBSD, OpenBSD, and NetBSD, do not have their own formal definitions of free software.
Users of these systems generally find the same set of software to be acceptable, but sometimes see copyleft as restrictive.
They generally advocate permissive free software licenses, which allow others to make software based on their source code, and then release the modified result as proprietary software.
Their view is that this permissive approach is more free.
The Kerberos, X.org, and Apache software licenses are substantially similar in intent and implementation.
All of these software packages originated in academic institutions interested in wide technology transfer (University of California, MIT, and UIUC).
Examples of free software
The Free Software Directory is a free software project that maintains a large database of free software packages.
Notable free software
GUI related
X Window System
GNOME
KDE
Xfce desktop environments
OpenOffice.org office suite
Mozilla and Firefox web browsers.
Typesetting and document preparation systems
TeX
LaTeX
Graphics tools like GIMP image graphics editor and Blender 3D animation program.
Text editors like vi or emacs.
ogg is a free software multimedia container, used to hold ogg vorbis sound and ogg theora video.
Relational database systems
MySQL
PostgreSQL
GCC compilers, GDB debugger and the GNU C Library.
Programming languages
Java
Perl
PHP
Python
Lua
Ruby
Tcl
Servers
Apache web server
BIND name server
Sendmail mail transport
Samba file server.
Operating systems
GNU/Linux
BSD
Darwin
OpenSolaris
Free software licenses
All free software licenses must grant people all the freedoms discussed above.
However, unless the applications' licenses are compatible, combining programs by mixing source code or directly linking binaries is problematic, because of license technicalities.
Programs indirectly connected together may avoid this problem.
The majority of free software uses a small set of licenses.
The most popular of these licenses are:
the GNU General Public License
the GNU Lesser General Public License
the BSD License
the Mozilla Public License
the MIT License
the Apache License
The Free Software Foundation and the Open Source Initiative both publish lists of licenses that they find to comply with their own definitions of free software and open-source software respectively.
List of FSF approved software licenses
List of OSI approved software licenses
These lists are necessarily incomplete, because a license need not be known by either organization in order to provide these freedoms.
Apart from these two organizations, the Debian project is seen by some to provide useful advice on whether particular licenses comply with their Debian Free Software Guidelines.
Debian doesn't publish a list of approved licenses, so its judgments have to be tracked by checking what software they have allowed into their software archives.
That is summarized at the Debian web site.
However, it is rare that a license is announced as being in-compliance by either FSF or OSI guidelines and not vice versa (the Netscape Public License used for early versions of Mozilla being an exception), so exact definitions of the terms have not become hot issues.
Permissive and copyleft licenses
The FSF categorizes licenses in the following ways:
Public domain software - the copyright has expired, the work was not copyrighted or the author has abandoned the copyright.
Since public-domain software lacks copyright protection, it may be freely incorporated into any work, whether proprietary or free.
Permissive licenses, also called BSD-style because they are applied to much of the software distributed with the BSD operating systems.
The author retains copyright solely to disclaim warranty and require proper attribution of modified works, but permits redistribution and modification in any work, even proprietary ones.
Copyleft licenses, the GNU General Public License being the most prominent.
The author retains copyright and permits redistribution and modification provided all such redistribution is licensed under the same license.
Additions and modifications by others must also be licensed under the same 'copyleft' license whenever they are distributed with part of the original licensed product.
Security and reliability
There is debate over the security of free software in comparison to proprietary software, with a major issue being security through obscurity.
A popular quantitative test in computer security is using relative counting of known unpatched security flaws.
Generally, users of this method advise avoiding products which lack fixes for known security flaws, at least until a fix is available.
Some claim that this method is biased by counting more vulnerabilities for the free software, since its source code is accessible and its community is more forthcoming about what problems exist.
Free software advocates rebut that even if proprietary software does not have "published" flaws, flaws could still exist and possibly be known to malicious users.
The ability of users to view and modify the source code allows many more people to potentially analyse the code and possibly to have a higher rate of finding bugs and flaws than an average sized corporation could manage.
Users having access to the source code also makes creating and deploying spyware far more difficult.
David A. Wheeler has published research concluding that free software is quantitatively more reliable than proprietary software.
Adoption
Free software played a part in the development of the Internet, the World Wide Web and the infrastructure of dot-com companies.
Free software allows users to cooperate in enhancing and refining the programs they use; free software is a pure public good rather than a private good.
Companies that contribute to free software can increase commercial innovation amidst the void of patent cross licensing lawsuits.
(See mpeg2 patent holders)
Under the free software business model, free software vendors may charge a fee for distribution and offer pay support and software customization services.
Proprietary software uses a different business model, where a customer of the proprietary software pays a fee for a license to use the software.
This license may grant the customer the ability to configure some or no parts of the software themselves.
Often some level of support is included in the purchase of proprietary software, but additional support services (especially for enterprise applications) are usually available for an additional fee.
Some proprietary software vendors will also customize software for a fee.
Free software is generally available at little to no cost and can result in permanently lower costs compared to proprietary software.
With free software, businesses can fit software to their specific needs by changing the software themselves or by hiring programmers to modify it for them.
Free software often has no warranty, and more importantly, generally does not assign legal liability to anyone.
However, warranties are permitted between any two parties upon the condition of the software and its usage.
Such an agreement is made separately from the free software license.
Controversies
Binary blobs
In 2006, OpenBSD started the first campaign against the use of binary blobs, in kernels.
Blobs are usually freely distributable device drivers for hardware from vendors that do not reveal driver source code to users or developers.
This restricts the users' freedom to effectively modify the software and distribute modified versions.
Also, since the blobs are undocumented and may have bugs, they pose a security risk to any operating system whose kernel includes them.
The proclaimed aim of the campaign against blobs is to collect hardware documentation that allows developers to write free software drivers for that hardware, ultimately enabling all free operating systems to become or remain blob-free.
The issue of binary blobs in the Linux kernel and other device drivers motivated some developers in Ireland to launch gNewSense, a GNU/Linux distribution with all the binary blobs removed.
The project received support from the Free Software Foundation
BitKeeper
Larry McVoy invited high-profile free software projects to use his proprietary versioning system, BitKeeper, free of charge, in order to attract paying users.
In 2002, Linux coordinator Linus Torvalds decided to use BitKeeper to develop the Linux kernel, a free software project, claiming no free software alternative met his needs.
This controversial decision drew criticism from several sources, including the Free Software Foundation's founder Richard Stallman.
Following the apparent reverse engineering of BitKeeper's protocols, McVoy withdrew permission for gratis use by free software projects, leading the Linux kernel community to develop a free software replacement in Git.
Patent deals
In November 2006, the Microsoft and Novell software corporations announced a controversial partnership involving, among other things, patent protection for some customers of Novell under certain conditions.
Freeware
Freeware is computer software that is available for use at no cost or for an optional fee.
Freeware is often made available in a binary-only, proprietary form, thus making it distinct from free software.
Proprietary freeware allows authors to contribute something for the benefit of the community, while at the same time allowing them to retain control of the source code and preserve its business potential.
Freeware is different from shareware, where the user is obliged to pay (e.g. after some trial period or for additional functionality).
History
The term freeware was coined by Andrew Fluegelman when he wanted to sell a communications program named PC-Talk that he had created but for which he did not wish to use traditional methods of distribution because of their cost.
Fluegelman actually distributed PC-Talk via a process now referred to as shareware.
Current use of the term freeware does not necessarily match the original concept by Andrew Fluegelman.
Criteria
The only criterion for being classified as freeware is that the software must be fully functional for an unlimited time with no monetary cost.
The software license may impose one or more other restrictions on the type of use including personal use, individual use, non-profit use, non-commercial use, academic use, commercial use or any combination of these.
For instance, the license may be "free for personal, non-commercial use."
Everything created with the freeware programs can be distributed at no cost (for example graphic, documents, or sounds made by user).
French language
French (français, ) is today spoken around the world by 72 to 130 million people as a native language, and by about 190 to 600 million people as a second or third language, with significant speakers in 54 countries.
Most native speakers of the language live in France, where the language originated.
The rest live in Canada, Belgium and Switzerland.
French is a descendant of the Latin language of the Roman Empire, as are languages such as Portuguese, Spanish, Italian, Catalan and Romanian.
Its development was also influenced by the native Celtic languages of Roman Gaul and by the Germanic language of the post-Roman Frankish invaders.
It is an official language in 29 countries, most of which form what is called in French La Francophonie, the community of French-speaking nations.
It is an official language of all United Nations agencies and a large number of international organizations.
According to the European Union, 129 million (26% of the 497,198,740) people in 27 member states speak French, of which 59 million (12%) speak it natively and 69 million (14%) claim to speak it as a second language, which makes it the third most spoken second language in the Union, after English and German respectively.
Geographic distribution
Europe
Legal status in France
Per the Constitution of France, French has been the official language since 1992 (although previous legal texts have made it official since 1539, see ordinance of Villers-Cotterêts).
France mandates the use of French in official government publications, public education outside of specific cases (though these dispositions are often ignored) and legal contracts; advertisements must bear a translation of foreign words.
In addition to French, there are also a variety of regional languages.
France has signed the European Charter for Regional Languages but has not ratified it since that would go against the 1958 Constitution.
Switzerland
French is one of the four official languages of Switzerland (along with German, Italian, and Romansh) and is spoken in the part of Switzerland called Romandie.
French is the native language of about 20% of the Swiss population.
Belgium
In Belgium, French is the official language of Wallonia (excluding the East Cantons, which are German-speaking) and one of the two official languages—along with Dutch—of the Brussels-Capital Region where it is spoken by the majority of the population, though often not as their primary language.
French and German are not official languages nor recognised minority languages in the Flemish Region, although along borders with the Walloon and Brussels-Capital regions, there are a dozen of municipalities with language facilities for French-speakers; a mirroring situation exists for the Walloon Region with respect to the Dutch and German languages.
In total, native French-speakers make up about 40% of the country's population, the remaining 60% speak Dutch, the latter of which 59% claim to speak French as a second language.
French is thus known by an estimated 75% of all Belgians, either as a mother tongue, as second, or as third language.
Monaco and Andorra
Although Monégasque is the national language of the Principality of Monaco, French is the only official language, and French nationals make up some 47% of the population.
Catalan is the only official language of Andorra; however, French is commonly used due to the proximity to France.
French nationals make up 7% of the population.
Italy
French is also an official language, along with Italian, in the province of Aosta Valley, Italy.
In addition, a number of Franco-Provençal dialects are spoken in the province, although they do not have official recognition.
Luxembourg
French is one of three official languages of the Grand Duchy of Luxembourg ;
the other official languages of Luxembourg are
German
Luxemburgish.
Luxemburgish is the natively-spoken language of Luxembourg ;
Luxembourg's education system is trilingual: the first years of primary school are in Luxembourgish, before changing to German, while secondary school, the language of instruction changes to French.
The Channel Islands
Although Jersey and Guernsey, the two bailiwicks collectively referred to as the Channel Islands, are separate entities, both use French to some degree, mostly in an administrative capacity.
Jersey Legal French is the standardized variety used in Jersey.
The Americas
Legal status in Canada
About 7 million Canadians are native French-speakers, of whom 6 million live in Quebec, and French is one of Canada's two official languages (the other being English).
Various provisions of the Canadian Charter of Rights and Freedoms deal with Canadians' right to access services in both languages, including the right to a publicly funded education in the minority language of each province, where numbers warrant in a given locality.
By law, the federal government must operate and provide services in both English and French, proceedings of the Parliament of Canada must be translated into both these languages, and most products sold in Canada must have labeling in both languages.
Overall, about 13% of Canadians have knowledge of French only, while 18% have knowledge of both English and French.
In contrast, over 82% of the population of Quebec speaks French natively, and almost 96% speak it as either their first or second language.
It has been the sole official language of Quebec since 1974.
The legal status of French was further strengthened with the 1977 adoption of the Charter of the French Language (popularly known as Bill 101), which guarantees that every person has a right to have the civil administration, the health and social services, corporations, and enterprises in Quebec communicate with him in French.
While the Charter mandates that certain provincial government services, such as those relating to health and education, be offered to the English minority in its language, where numbers warrant, its primary purpose is to cement the role of French as the primary language used in the public sphere.
[[Image:Knowledge French EU map.png|right|thumb|240px|Knowledge of French in the European Union and candidate countries]] The provision of the Charter that has arguably had the most significant impact mandates French-language education unless a child's parents or siblings have received the majority of their own primary education in English within Canada, with minor exceptions.
This measure has reversed a historical trend whereby a large number of immigrant children would attend English schools.
In so doing, the Charter has greatly contributed to the "visage français" (French face) of Montreal in spite of its growing immigrant population.
Other provisions of the Charter have been ruled unconstitutional over the years, including those mandating French-only commercial signs, court proceedings, and debates in the legislature.
Though none of these provisions are still in effect today, some continued to be on the books for a time even after courts had ruled them unconstitutional as a result of the government's decision to invoke the so-called notwithstanding clause of the Canadian constitution to override constitutional requirements.
In 1993, the Charter was rewritten to allow signage in other languages so long as French was markedly "predominant."
Another section of the Charter guarantees every person the right to work in French, meaning the right to have all communications with one's superiors and coworkers in French, as well as the right not to be required to know another language as a condition of hiring, unless this is warranted by the nature of one's duties, such as by reason of extensive interaction with people located outside the province or similar reasons.
This section has not been as effective as had originally been hoped, and has faded somewhat from public consciousness.
As of 2006, approximately 65% of the workforce on the island of Montreal predominantly used French in the workplace.
The only other province that recognizes French as an official language is New Brunswick, which is officially bilingual, like the nation as a whole.
Outside of Quebec, the highest number of Francophones in Canada, 485,000, excluding those who claim multiple mother tongues, reside in Ontario, whereas New Brunswick, home to the vast majority of Acadians, has the highest percentage of Francophones after Quebec, 33%, or 237,000.
In Ontario, Nova Scotia, Prince Edward Island, and Manitoba, French does not have full official status, although the provincial governments do provide some French-language services in all communities where significant numbers of Francophones live.
Canada's three northern territories (Yukon, Northwest Territories, and Nunavut) all recognize French as an official language as well.
All provinces make some effort to accommodate the needs of their Francophone citizens, although the level and quality of French-language service vary significantly from province to province.
The Ontario French Language Services Act, adopted in 1986, guarantees French language services in that province in regions where the Francophone population exceeds 10% of the total population, as well as communities with Francophone populations exceeding 5,000, and certain other designated areas; this has the most effect in the north and east of the province, as well as in other larger centres such as Ottawa, Toronto, Hamilton, Mississauga, London, Kitchener, St. Catharines, Greater Sudbury and Windsor.
However, the French Language Services Act does not confer the status of "official bilingualism" on these cities, as that designation carries with it implications which go beyond the provision of services in both languages.
The City of Ottawa's language policy (by-law 2001-170) allows employees to work in their official language of choice and be supervised in the language of choice.
Canada has the status of member state in the Francophonie, while the provinces of Quebec and New Brunswick are recognized as participating governments.
Ontario is currently seeking to become a full member on its own.
Haiti
French is an official language of Haiti, although it is mostly spoken by the upper class, while Haitian Creole (a French-based creole language) is more widely spoken as a mother tongue.
French overseas territories
French is also the official language in France's overseas territories of French Guiana, Guadeloupe, Martinique, Saint Barthélemy, St. Martin and Saint-Pierre and Miquelon.
The United States
Although it has no official recognition on a federal level, French is the third most-spoken language in the United States, after English and Spanish, and the second most-spoken in the states of Louisiana, Maine, Vermont and New Hampshire.
Louisiana is home to two distinct dialects, Cajun French and Creole French
Africa
A majority of the world's French-speaking population lives in Africa.
According to the 2007 report by the Organisation internationale de la Francophonie, an estimated 115 million African people spread across 31 francophone African countries can speak French either as a first or second language.
French is mostly a second language in Africa, but in some areas it has become a first language, such as in the region of Abidjan, Côte d'Ivoire and in Libreville, Gabon.
It is impossible to speak of a single form of African French, but rather of diverse forms of African French which have developed due to the contact with many indigenous African languages.
In the territories of the Indian Ocean, the French language is often spoken alongside French-derived creole languages, the major exception being Madagascar.
There, a Malayo-Polynesian language (Malagasy) is spoken alongside French.
The French language has also met competition with English since English has been the official language in Mauritius and the Seychelles for a long time and has recently become an official language of Madagascar.
Sub-Saharan Africa is the region where the French language is most likely to expand due to the expansion of education and it is also there the language has evolved most in recent years.
Some vernacular forms of French in Africa can be difficult to understand for French speakers from other countries but written forms of the language are very closely related to those of the rest of the French-speaking world.
French is an official language of many African countries, most of them former French or Belgian colonies:
Benin
Burkina Faso
Burundi
Cameroon
Central African Republic
Chad
Comoros
Congo (Brazzaville)
Côte d'Ivoire
Democratic Republic of the Congo
Djibouti
Equatorial Guinea (former colony of Spain)
Gabon
Guinea
Madagascar
Mali
Niger
Rwanda
Senegal
Seychelles
Togo
In addition, French is an administrative language and commonly used though not on an official basis in Mauritius and in the Maghreb states:
Mauritania
Algeria
Morocco
Tunisia.
Various reforms have been implemented in recent decades in Algeria to improve the status of Arabic relative to French, especially in education.
While the predominant European language in Egypt is English, French is considered to be a more sophisticated language by some elements of the Egyptian upper and upper-middle classes; for this reason, a typical educated Egyptian will learn French in addition to English at some point in his or her education.
The perception of sophistication may be related to the use of French as the royal court language of Egypt during the nineteenth century.
Egypt participates in La Francophonie.
French is also the official language of Mayotte and Réunion, two overseas territories of France located in the Indian Ocean, as well as an administrative and educational language in Mauritius, along with English.
Asia
Lebanon
French was the official language in Lebanon along with Arabic until 1941, the country's declaration of independence from France.
French is still seen as an official language by the Lebanese people as it is widely used by the Lebanese, especially for administrative purposes, and is taught in schools as a primary language along with Arabic.
Southeast Asia
French is an administrative language in Laos and Cambodia.
French was historically spoken by the elite in the leased territory Guangzhouwan in southern China.
In colonial Vietnam, the elites spoke French and many who worked for the French spoke a French creole known as "Tây Bồi" (now extinct).
India
French has official status in the Indian Union Territory of Pondicherry, along with the regional language Tamil and some students of Tamil Nadu may opt French as their third or fourth language (usually behind English, Tamil, Hindi).
French is also commonly taught as third language in secondary school in most cities of Maharashtra State including Mumbai as part of the Secondary (X-SSC) and Higher secondary School (XII-HSC) certificate examinations.
Oceania
French is also a second official language of the Pacific Island nation of Vanuatu, along with France's territories of French Polynesia, Wallis & Futuna and New Caledonia.
Dialects
Acadian French
African French
Aostan French
Belgian French
Cajun French
Canadian French
Cambodian French
Guyana French (see French Guiana)
Indian French
Jersey Legal French
Lao French
Levantine French (most commonly referred to as Lebanese French, very similar to Maghreb French)
Louisiana Creole French
Maghreb French (see also North African French)
Meridional French
Metropolitan French
New Caledonian French
Newfoundland French
Oceanic French
Quebec French
South East Asian French
Swiss French
Vietnamese French
West Indian French
History
Sounds
Although there are many French regional accents, only one version of the language is normally chosen as a model for foreign learners, which has no commonly used special name, but has been termed français neutre (neutral French).
Voiced stops (i.e. ) are typically produced fully voiced throughout.
Voiceless stops (i.e. ) are unaspirated.
Nasals: The velar nasal occurs only in final position in borrowed (usually English) words: parking, camping, swing.
The palatal nasal can occur in word initial position (e.g. gnon), but it is most frequently found in intervocalic, onset position or word-finally (e.g. montagne).
Fricatives: French has three pairs of homorganic fricatives distinguished by voicing, i.e. labiodental , dental , and palato-alveolar .
Notice that are dental, like the plosives , and the nasal .
French has one rhotic whose pronunciation varies considerably among speakers and phonetic contexts.
In general it is described as a voiced uvular fricative as in roue "wheel" .
Vowels are often lengthened before this segment.
It can be reduced to an approximant, particularly in final position (e.g. "fort") or reduced to zero in some word-final positions.
For other speakers, a uvular trill is also fairly common, and an apical trill occurs in some dialects.
Lateral and central approximants: The lateral approximant is unvelarised in both onset (lire) and coda position (il).
In the onset, the central approximants , , and each correspond to a high vowel, , , and respectively.
There are a few minimal pairs where the approximant and corresponding vowel contrast, but there are also many cases where they are in free variation.
Contrasts between and occur in final position as in paye "pay" vs. pays "country".
French pronunciation follows strict rules based on spelling, but French spelling is often based more on history than phonology.
The rules for pronunciation vary between dialects, but the standard rules are:
final consonants: Final single consonants, in particular s, x, z, t, d, n and m, are normally silent.
(The final letters c, r, f and l, however, are normally pronounced.)
When the following word begins with a vowel, though, a silent consonant may once again be pronounced, to provide a liaison or "link" between the two words.
Some liaisons are mandatory, for example the s in les amants or vous avez; some are optional, depending on dialect and register, for example the first s in deux cents euros or euros irlandais; and some are forbidden, for example the s in beaucoup d'hommes aiment.
The t of et is never pronounced and the silent final consonant of a noun is only pronounced in the plural and in set phrases like pied-à-terre.
Note that in the case of a word ending d as in pied-à-terre, the consonant t is pronounced instead.
Doubling a final n and adding a silent e at the end of a word (e.g. chien → chienne) makes it clearly pronounced.
Doubling a final l and adding a silent e (e.g. gentil → gentille) adds a [j] sound.
elision or vowel dropping: Some monosyllabic function words ending in a or e, such as je and que, drop their final vowel when placed before a word that begins with a vowel sound (thus avoiding a hiatus).
The missing vowel is replaced by an apostrophe. (e.g. je ai is instead pronounced and spelt → j'ai).
This gives for example the same pronunciation for l'homme qu'il a vu ("the man whom he saw") and l'homme qui l'a vu ("the man who saw him").
Orthography
Nasal: n and m.
When n or m follows a vowel or diphthong, the n or m becomes silent and causes the preceding vowel to become nasalized (i.e. pronounced with the soft palate extended downward so as to allow part of the air to leave through the nostrils).
Exceptions are when the n or m is doubled, or immediately followed by a vowel.
The prefixes en- and em- are always nasalized.
The rules get more complex than this but may vary between dialects.
Digraphs: French does not introduce extra letters or diacritics to specify its large range of vowel sounds and diphthongs, rather it uses specific combinations of vowels, sometimes with following consonants, to show which sound is intended.
Gemination: Within words, double consonants are generally not pronounced as geminates in modern French (but geminates can be heard in the cinema or TV news from as recently as the 1970s, and in very refined elocution they may still occur).
For example, illusion is pronounced and not .
But gemination does occur between words.
For example, une info ("a news") is pronounced , whereas une nympho ("a nympho") is pronounced .
Accents are used sometimes for pronunciation, sometimes to distinguish similar words, and sometimes for etymology alone.
Accents that affect pronunciation
The acute accent (l'accent aigu), é (e.g. école—school), means that the vowel is pronounced instead of the default .
The grave accent (l'accent grave), è (e.g. élève—pupil) means that the vowel is pronounced instead of the default .
The circumflex (l'accent circonflexe) ê (e.g. forêt—forest) shows that an e is pronounced and that an o is pronounced .
In standard French it also signifies a pronunciation of for the letter a, but this differentiation is disappearing.
In the late 19th century, the circumflex was used in place of s where that letter was not to be pronounced.
Thus, forest became forêt and hospital became hôpital.
The diaeresis (le tréma) (e.g. naïf—foolish, Noël—Christmas) as in English, specifies that this vowel is pronounced separately from the preceding one, not combined and is not a schwa.
The cedilla (la cédille) ç (e.g. garçon—boy) means that the letter c is pronounced in front of the hard vowels a, o and u (c is otherwise before a hard vowel).
C is always pronounced in front of the soft vowels e, i, and y, thus ç is never found in front of soft vowels.
Accents with no pronunciation effect
The circumflex does not affect the pronunciation of the letters i or u, and in most dialects, a as well.
It usually indicates that an s came after it long ago, as in hôtel.
All other accents are used only to distinguish similar words, as in the case of distinguishing the adverbs là and où ("there", "where") from the article la and the conjunction ou ("the" fem. sing., "or") respectively.
Grammar
French grammar shares several notable features with most other Romance languages, including:
the loss of Latin's declensions
only two grammatical genders
the development of grammatical articles from Latin demonstratives
new tenses formed from auxiliaries
French word order is Subject Verb Object, except when the object is a pronoun, in which case the word order is Subject Object Verb.
Some rare archaisms allow for different word orders.
Vocabulary
The majority of French words derive from Vulgar Latin or were constructed from Latin or Greek roots.
There are often pairs of words, one form being "popular" (noun) and the other one "savant" (adjective), both originating from Latin.
Example:
brother: frère / fraternel < from Latin frater
finger: doigt / digital < from Latin digitus
faith: foi / fidèle < from Latin fides
cold: froid / frigide < from Latin frigidus
eye: œil / oculaire < from Latin oculus
In some examples there is a common word from Vulgar Latin and a more savant word borrowed directly from Medieval Latin or even Ancient Greek.
Cheval—Concours équestre—Hippodrome
The French words which have developed from Latin are usually less recognisable than Italian words of Latin origin because as French evolved from Vulgar Latin, the unstressed final syllable of many words was dropped or elided into the following word.
It is estimated that 12% (4,200) of common French words found in a typical dictionary such as the Petit Larousse or Micro-Robert Plus (35,000 words) are of foreign origin.
About 25% (1,054) of these foreign words come from English and are fairly recent borrowings.
The others are some 707 words from Italian, 550 from ancient Germanic languages, 481 from ancient Gallo-Romance languages, 215 from Arabic, 164 from German, 160 from Celtic languages, 159 from Spanish, 153 from Dutch, 112 from Persian and Sanskrit, 101 from Native American languages, 89 from other Asian languages, 56 from other Afro-Asiatic languages, 55 from Slavic languages and Baltic languages, 10 for Basque and 144 — about three percent — from other languages.
Numerals
The French counting system is partially vigesimal: twenty () is used as a base number in the names of numbers from 60–99.
The French word for eighty, for example, is , which literally means "four twenties", and (literally "sixty-fifteen") means 75.
This reform arose after the French Revolution to unify the different counting system (mostly vigesimal near the coast, due to Celtic (via Basque) and Viking influence).
This system is comparable to the archaic English use of score, as in "fourscore and seven" (87), or "threescore and ten" (70).
Belgian French and Swiss French are different in this respect.
In Belgium and Switzerland 70 and 90 are and .
In Switzerland, depending on the local dialect, 80 can be (Geneva, Neuchâtel, Jura) or (Vaud, Valais, Fribourg).
Octante had been used in Switzerland in the past, but is now considered archaic.
In Belgium, however, quatre-vingts is universally used.
Writing system
French is written using the 26 letters of the Latin alphabet, plus five diacritics (the circumflex accent, acute accent, grave accent, diaeresis, and cedilla) and the two ligatures (œ) and (æ).
French spelling, like English spelling, tends to preserve obsolete pronunciation rules.
This is mainly due to extreme phonetic changes since the Old French period, without a corresponding change in spelling.
Moreover, some conscious changes were made to restore Latin orthography:
Old French doit > French doigt "finger" (Latin digitus)
Old French pie > French pied "foot" (Latin pes (stem: ped-)
As a result, it is difficult to predict the spelling on the basis of the sound alone.
Final consonants are generally silent, except when the following word begins with a vowel.
For example, all of these words end in a vowel sound: pied, aller, les, finit, beaux.
The same words followed by a vowel, however, may sound the consonants, as they do in these examples: beaux-arts, les amis, pied-à-terre.
On the other hand, a given spelling will almost always lead to a predictable sound, and the Académie française works hard to enforce and update this correspondence.
In particular, a given vowel combination or diacritic predictably leads to one phoneme.
The diacritics have phonetic, semantic, and etymological significance.
acute accent (é): Over an e, indicates the sound of a short ai in English, with no diphthong.
An é in modern French is often used where a combination of e and a consonant, usually s, would have been used formerly: écouter < escouter.
This type of accent mark is called accent aigu in French.
grave accent (à, è, ù): Over a or u, used only to distinguish homophones: à ("to") vs. a ("has"), ou ("or") vs. où ("where").
Over an e, indicates the sound .
circumflex (â, ê, î, ô, û): Over an a, e or o, indicates the sound , or , respectively (the distinction a vs. â tends to disappear in many dialects).
Most often indicates the historical deletion of an adjacent letter (usually an s or a vowel): château < castel, fête < feste, sûr < seur, dîner < disner.
It has also come to be used to distinguish homophones: du ("of the") vs. dû (past participle of devoir "to have to do something (pertaining to an act)"; note that dû is in fact written thus because of a dropped e: deu).
(See Use of the circumflex in French)
diaeresis or tréma (ë, ï, ü, ÿ): Indicates that a vowel is to be pronounced separately from the preceding one: naïve, Noël.
A diaeresis on y only occurs in some proper names and in modern editions of old French texts.
Some proper names in which ÿ appears include Aÿ (commune in canton de la Marne formerly Aÿ-Champagne), Rue des Cloÿs (alley in the 18th arrondisement of Paris), Croÿ (family name and hotel on the Boulevard Raspail, Paris), Château du Feÿ (near Joigny), Ghÿs (name of Flemish origin spelt Ghijs where ij in handwriting looked like ÿ to French clerks), l'Haÿ-les-Roses (commune between Paris and Orly airport), Pierre Louÿs (author), Moÿ (place in commune de l'Aisne and family name), and Le Blanc de Nicolaÿ (an insurance company in eastern France).
The diaresis on u appears only in the biblical proper names Archélaüs, Capharnaüm, Emmaüs, Ésaü and Saül.
Nevertheless, since the 1990 orthographic rectifications (which are not applied at all by most French people), the diaeresis in words containing guë (such as aiguë or ciguë) may be moved onto the u: aigüe, cigüe.
Words coming from German retain the old Umlaut (ä, ö and ü) if applicable but use French pronunciation, such as kärcher (trade mark of a pressure washer).
cedilla (ç): Indicates that an etymological c is pronounced when it would otherwise be pronounced /k/.
Thus je lance "I throw" (with c = before e), je lançais "I was throwing" (c would be pronounced before a without the cedilla).
The c cedilla (ç) softens the hard /k/ sound to /s/ before the vowels a, o or u, for example ça /sa/.
C cedilla is never used before the vowels e or i since these two vowels always produce a soft /s/ sound (ce, ci).
There are two ligatures, which have various origins.
The ligature œ is a mandatory contraction of oe in certain words.
Some of these are native French words, with the pronunciation or , e.g. sœur "sister" , œuvre "work (of art)" .
Note that it usually appears in the combination œu; œil is an exception.
Many of these words were originally written with the digraph eu; the o in the ligature represents a sometimes artificial attempt to imitate the Latin spelling: Latin bovem > Old French buef/beuf > Modern French bœuf. Œ is also used in words of Greek origin, as the Latin rendering of the Greek diphthong οι, e.g. cœlacanthe "coelacanth".
These words used to be pronounced with the vowel , but in recent years a spelling pronunciation with has taken hold, e.g. œsophage or .
The pronunciation with is often seen to be more correct.
The ligature œ is not used in some occurrences of the letter combination oe, for example, when o is part of a prefix (coexister).
The ligature æ is rare and appears in some words of Latin and Greek origin like ægosome, ægyrine, æschne, cæcum, nævus or uræus.
The vowel quality is identical to é .
French writing, as with any language, is affected by the spoken language.
In Old French, the plural for animal was animals.
Common speakers pronounced a u before a word ending in l as the plural.
This resulted in animauls.
As the French language evolved this vanished and the form animaux (aux pronounced ) was admitted.
The same is true for cheval pluralized as chevaux and many others.
Also castel pl. castels became château pl. châteaux.
Samples
German language
The German language () is a West Germanic language and one of the world's major languages.
German is closely related to and classified alongside English and Dutch.
Around the world, German is spoken by approximately 100 million native speakers and also about 80 million non-native speakers, and Standard German is widely taught in schools, universities, and Goethe Institutes worldwide.
Geographic distribution
Europe
German is spoken primarily in Germany (95%), Austria (89%) and Switzerland (64%) together with Liechtenstein, Luxembourg (D-A-CH-Li-Lux) constituting the countries where German is the majority language.
Other European German-speaking communities are found in Italy (Bolzano-Bozen), in the East Cantons of Belgium, in the french area Alsace which often was traded between Germany and France in history and in some border villages of the former South Jutland County (in German, Nordschleswig, in Danish, Sønderjylland) of Denmark.
Some German-speaking communities still survive in parts of Romania, the Czech Republic, Poland, Hungary, and above all Russia and Kazakhstan, although forced expulsions after World War II and massive emigration to Germany in the 1980s and 1990s have depopulated most of these communities.
It is also spoken by German-speaking foreign populations and some of their descendants in Portugal, Spain, Italy, Morocco, Egypt, Israel, Cyprus, Turkey, Greece, United Kingdom, Netherlands, Scandinavia, Siberia in Russia, Hungary, Romania, Bulgaria, and the former Yugoslavia (Bosnia, Serbia, Macedonia, Croatia and Slovenia).
In Luxembourg and the surrounding areas, big parts of the native population speak German dialects, and some people also master standard German (especially in Luxembourg), although in the French regions of Alsace (German: Elsass) and Lorraine (German: Lothringen) French has replaced the local German dialects as the official language, even though it has not been fully replaced on the street.
Overseas
Outside of Europe and the former Soviet Union, the largest German-speaking communities are to be found in the United States, Canada, Brazil and in Argentina where millions of Germans migrated in the last 200 years; but the vast majority of their descendants no longer speak German.
Additionally, German-speaking communities can be found in the former German colony of Namibia independent from South Africa since 1990, as well as in the other countries of German emigration such as Canada, Mexico, Dominican Republic, Paraguay, Uruguay, Chile, Peru, Venezuela (where Alemán Coloniero developed), South Africa and Australia.
South America
In Brazil the largest concentrations of German speakers are in Rio Grande do Sul (where Riograndenser Hunsrückisch was developed), Santa Catarina, Paraná, and Espírito Santo, and large German-speaking descendant communities in Argentina, Uruguay and Chile.
In the 20th century, over 100,000 German political refugees and invited entrepreneurs settled in Latin America, such as Costa Rica, Panama, Venezuela and the Dominican Republic to establish German-speaking enclaves, and there is a reportedly small German immigration to Puerto Rico.
North America
The United States has the largest concentration of German speakers outside of Europe; an indication of this presence can be found in the names of such villages and towns as New Leipzig, Munich, Karlsruhe, and Strasburg, North Dakota, and New Braunfels, Texas.
Though over the course of the 20th century many of the descendants of 18th and 19th-century immigrants ceased speaking German at home, small populations of elderly (as well as some younger) speakers can be found in Pennsylvania (Amish, Hutterites, Dunkards and some Mennonites historically spoke Pennsylvania Dutch (a West Central German variety) and Hutterite German), Kansas (Mennonites and Volga Germans), North Dakota (Hutterite Germans, Mennonites, Russian Germans, Volga Germans, and Baltic Germans), South Dakota, Montana, Texas (Texas German), Wisconsin, Indiana, Louisiana and Oklahoma.
Early twentieth century immigration was often to St. Louis, Chicago, New York, Pittsburgh and Cincinnati.
Most of the post–World War II wave are in the New York, Philadelphia, Los Angeles, San Francisco and Chicago urban areas, and in Florida, Arizona and California where large communities of retired German, Swiss and Austrian expatriates live.
The American population of German ancestry is above 60 million.
The German language is the third largest language in the U.S. after Spanish.
In Canada there are people of German ancestry throughout the country and especially in the western cities such as Kelowna.
German is also spoken in Ontario and southern Nova Scotia.
There is a large and vibrant community in the city of Kitchener, Ontario.
German immigrants were instrumental in the country's three largest urban areas: Montreal, Toronto and Vancouver, but post-WWII immigrants managed to preserve a fluency in the German language in their respective neighborhoods and sections.
In the first half of the 20th century, over a million German-Canadians made the language one of Canada's most spoken after French.
In Mexico there are also large populations of German ancestry, mainly in the cities of: Mexico City, Puebla, Mazatlán, Tapachula, and larger populations scattered in the states of Chihuahua, Durango, and Zacatecas.
German ancestry is also said to be found in neighboring towns around Guadalajara, Jalisco and much of Northern Mexico, where German influence was immersed into the Mexican culture.
Standard German is spoken by the affluent German communities in Puebla, Mexico City, Nuevo Leon, San Luis Potosi and Quintana Roo.
German immigration in the twentieth century was small, but produced German-speaking communities in Central America (i.e.
Guatemala, Honduras and Nicaragua) and the Caribbean Islands like the Dominican Republic.
Dialects in North America:
The dialects of German which are or were primarily spoken in colonies or communities founded by German speaking people resemble the dialects of the regions the founders came from.
For example, Pennsylvania German resembles dialects of the Palatinate, and Hutterite German resembles dialects of Carinthia.
Texas German is a dialect spoken in the areas of Texas settled by the Adelsverein, such as New Braunfels and Fredericksburg.
In the Amana Colonies in the state of Iowa Amana German is spoken.
Plautdietsch is a large minority language spoken in Northern Mexico by the Mennonite communities, and is spoken by more than 200,000 people in Mexico.
Hutterite German is an Upper German dialect of the Austro-Bavarian variety of the German language, which is spoken by Hutterite communities in Canada and the United States.
Hutterite is spoken in the U.S. states of Washington, Montana, North Dakota and South Dakota, and Minnesota; and in the Canadian provinces of Alberta, Saskatchewan and Manitoba.
Its speakers belong to some Schmiedleit, Lehrerleit, and Dariusleit Hutterite groups, but there are also speakers among the older generations of Prairieleit (the descendants of those Hutterites who chose not to settle in colonies).
Hutterite children who grow up in the colonies learn and speak first Hutterite German before learning English in the public school, the standard language of the surrounding areas.
Many colonies though continue with German Grammar School, separate from the public school, throughout a student's elementary education.
Creoles
There is an important German creole being studied and recovered, named Unserdeutsch, spoken in the former German colony of Papua New Guinea, across Micronesia and in northern Australia (i.e. coastal parts of Queensland and Western Australia), by few elderly people.
The risk of its extinction is serious and efforts to revive interest in the language are being implemented by scholars.
Internet
According to Global Reach (2004), 6.9% of the Internet population is German.
According to Netz-tipp (2002), 7.7% of webpages are written in German, making it second only to English in the European language group.
They also report that 12% of Google's users use its German interface.
Older statistics: Babel (1998) found somewhat similar demographics.
FUNREDES (1998) and Vilaweb (2000) both found that German is the third most popular language used by websites, after English and Japanese.
History
The history of the language begins with the High German consonant shift during the migration period, separating High German dialects from common West Germanic.
The earliest testimonies of Old High German are from scattered Elder Futhark inscriptions, especially in Alemannic, from the 6th century, the earliest glosses (Abrogans) date to the 8th and the oldest coherent texts (the Hildebrandslied, the Muspilli and the Merseburg Incantations) to the 9th century.
Old Saxon at this time belongs to the North Sea Germanic cultural sphere, and Low Saxon should fall under German rather than Anglo-Frisian influence during the Holy Roman Empire.
As Germany was divided into many different states, the only force working for a unification or standardization of German during a period of several hundred years was the general preference of writers trying to write in a way that could be understood in the largest possible area.
When Martin Luther translated the Bible (the New Testament in 1522 and the Old Testament, published in parts and completed in 1534) he based his translation mainly on the bureaucratic standard language used in Saxony (sächsische Kanzleisprache) also known as Meißner-Deutsch (Meißner-German), which was the most widely understood language at this time, because the region it was spoken in was quite influential amongst the German states.
This language was based on Eastern Upper and Eastern Central German dialects and preserved much of the grammatical system of Middle High German (unlike the spoken German dialects in Central and Upper Germany that already at that time began to lose the genitive case and the preterite tense).
In the beginning, copies of the Bible had a long list for each region, which translated words unknown in the region into the regional dialect.
Roman Catholics rejected Luther's translation in the beginning and tried to create their own Catholic standard (gemeines Deutsch) — which, however, only differed from 'Protestant German' in some minor details.
It took until the middle of the 18th century to create a standard that was widely accepted, thus ending the period of Early New High German.
In 1901 the 2nd Orthographical Conference ended with a complete standardization of German language in written form while the Deutsche Bühnensprache (literally: German stage-language) had already established spelling-rules for German three years earlier which were later to become obligatory for general German pronunciation.
German used to be the language of commerce and government in the Habsburg Empire, which encompassed a large area of Central and Eastern Europe.
Until the mid-19th century it was essentially the language of townspeople throughout most of the Empire.
It indicated that the speaker was a merchant, an urbanite, not their nationality.
Some cities, such as Prague (German: Prag) and Budapest (Buda, German: Ofen), were gradually Germanized in the years after their incorporation into the Habsburg domain.
Others, such as Bratislava(German: Pressburg), were originally settled during the Habsburg period and were primarily German at that time.
A few cities such as Milan (German: Mailand) remained primarily non-German.
However, most cities were primarily German during this time, such as Prague, Budapest, Bratislava (German: Pressburg), Zagreb (German: Agram), and Ljubljana (German: Laibach), though they were surrounded by territory that spoke other languages.
Until about 1800, standard German was almost only a written language.
At this time, people in urban northern Germany, who spoke dialects very different from Standard German, learned it almost like a foreign language and tried to pronounce it as close to the spelling as possible.
Prescriptive pronunciation guides used to consider northern German pronunciation to be the standard.
However, the actual pronunciation of standard German varies from region to region.
Media and written works are almost all produced in standard German (often called Hochdeutsch in German) which is understood in all areas where German is spoken, except by pre-school children in areas which speak only dialect, for example Switzerland and Austria.
However, in this age of television, even they now usually learn to understand Standard German before school age.
The first dictionary of the Brothers Grimm, the 16 parts of which were issued between 1852 and 1860, remains the most comprehensive guide to the words of the German language.
In 1860, grammatical and orthographic rules first appeared in the Duden Handbook.
In 1901, this was declared the standard definition of the German language.
Official revisions of some of these rules were not issued until 1998, when the German spelling reform of 1996 was officially promulgated by governmental representatives of all German-speaking countries.
Since the reform, German spelling has been in an eight-year transitional period where the reformed spelling is taught in most schools, while traditional and reformed spellings co-exist in the media.
See German spelling reform of 1996 for an overview of the public debate concerning the reform with some major newspapers and magazines and several known writers refusing to adopt it.
The German spelling reform of 1996 led to public controversy indeed to considerable dispute.
Some state parliaments (Bundesländer) would not accept it (North Rhine Westphalia and Bavaria).
The dispute landed at one point in the highest court which made a short issue of it, claiming that the states had to decide for themselves and that only in schools could the reform be made the official rule - everybody else could continue writing as they had learned it.
After 10 years, without any intervention by the federal parliament, a major yet incomplete revision was installed in 2006, just in time for the new school year of 2006.
In 2007, some venerable spellings will be finally invalidated even though they caused little or no trouble.
The only sure and easily recognizable symptom of a text's being in compliance with the reform is the -ss at the end of words, like in dass and muss.
Classic spelling forbade this ending, instead using daß and muß.
The cause of the controversy evolved around the question whether a language is part of the culture which must be preserved or a means of communicating information which has to allow for growth.
(The reformers seemed to be unimpressed by the fact that a considerable part of that culture - namely the entire German literature of the 20th century - is in the old spelling.)
The increasing use of English in Germany's higher education system, as well as in business and in popular culture, has led various German academics to state, not necessarily from an entirely negative perspective, that German is a language in decline in its native country.
For example, Ursula Kimpel, of the University of Tübingen, said in 2005 that “German universities are offering more courses in English because of the large number of students coming from abroad.
German is unfortunately a language in decline.
We need and want our professors to be able to teach effectively in English.”
Standard German
Standard German originated not as a traditional dialect of a specific region, but as a written language.
However, there are places where the traditional regional dialects have been replaced by standard German; this is the case in vast stretches of Northern Germany, but also in major cities in other parts of the country.
Standard German differs regionally, between German-speaking countries, in vocabulary and some instances of pronunciation, and even grammar and orthography.
This variation must not be confused with the variation of local dialects.
Even though the regional varieties of standard German are only to a certain degree influenced by the local dialects, they are very distinct.
German is thus considered a pluricentric language.
In most regions, the speakers use a continuum of mixtures from more dialectal varieties to more standard varieties according to situation.
In the German-speaking parts of Switzerland, mixtures of dialect and standard are very seldom used, and the use of standard German is largely restricted to the written language.
Therefore, this situation has been called a medial diglossia.
Swiss Standard German is used in the Swiss education system.
Official status
Standard German is the only official language in Liechtenstein and Austria; it shares official status in Germany (with Danish, Frisian and Sorbian as minority languages), Switzerland (with French, Italian and Romansh), Belgium (with Dutch and French) and Luxembourg (with French and Luxembourgish).
It is used as a local official language in Italy (Province of Bolzano-Bozen), as well as in the cities of Sopron (Hungary), Krahule (Slovakia) and several cities in Romania.
It is the official language (with Italian) of the Vatican Swiss Guard.
German has an officially recognized status as regional or auxiliary language in Denmark (South Jutland region), France (Alsace and Moselle regions), Italy (Gressoney valley), Namibia, Poland (Opole region), and Russia (Asowo and Halbstadt).
German is one of the 23 official languages of the European Union.
It is the language with the largest number of native speakers in the European Union, and, shortly after English and long before French, the second-most spoken language in Europe.
German as a foreign language
German is the third most taught foreign language in the English speaking world after French and Spanish.
German is the main language of about 90–95 million people in Europe (as of 2004), or 13.3% of all Europeans, being the second most spoken native language in Europe after Russian, above French (66.5 million speakers in 2004) and English (64.2 million speakers in 2004).
It is therefore the most spoken first language in the EU.
It is the second most known foreign language in the EU.
It is one of the official languages of the European Union, and one of the three working languages of the European Commission, along with English and French.
Thirty-two percent of citizens of the EU-15 countries say they can converse in German (either as a mother tongue or as a second or foreign language).
This is assisted by the widespread availability of German TV by cable or satellite.
German was once, and still remains to some extent, a lingua franca in Central, Eastern and Northern Europe.
Dialects
German is a member of the western branch of the Germanic family of languages, which in turn is part of the Indo-European language family.
The German dialect continuum is traditionally divided most broadly into High German and Low German.
The variation among the German dialects is considerable, with only the neighbouring dialects being mutually intelligible.
Some dialects are not intelligible to people who only know standard German.
However, all German dialects belong to the dialect continuum of High German and Low Saxon languages.
Until roughly the end of the Second World War, there was a dialect continuum of all the continental West Germanic languages because nearly any pair of neighbouring dialects were perfectly mutually intelligible.
Low German
Low Saxon varieties (spoken on German territory) are considered linguistically a language separate from the German language by some, but just a dialect by others.
Sometimes, Low Saxon and Low Franconian are grouped together because both are unaffected by the High German consonant shift.
However, the part of the population capable of speaking and responding to it, or of understanding it has decreased continuously since WWII.
Currently the effort to maintain a residual presence in cultural life is negligible.
Middle Low German was the lingua franca of the Hanseatic League.
It was the predominant language in Northern Germany.
This changed in the 16th century.
In 1534 the Luther Bible by Martin Luther was printed.
This translation is considered to be an important step towards the evolution of the Early New High German.
It aimed to be understandable to an ample audience and was based mainly on Central and Upper German varieties.
The Early New High German language gained more prestige than Low Saxon and became the language of science and literature.
Other factors were that around the same time, the Hanseatic league lost its importance as new trade routes to Asia and the Americas were established, and that the most powerful German states of that period were located in Middle and Southern Germany.
The 18th and 19th centuries were marked by mass education, the language of the schools being standard German.
Slowly Low Saxon was pushed back and back until it was nothing but a language spoken by the uneducated and at home.
Today Low Saxon can be divided in two groups: Low Saxon varieties with a reasonable standard German influx and varieties of Standard German with a Low Saxon influence known as Missingsch.
High German
High German is divided into Central German and Upper German.
Central German dialects include Ripuarian, Moselle Franconian, Hessian, Thuringian, South Franconian, Lorraine Franconian and Upper Saxon.
It is spoken in the southeastern Netherlands, eastern Belgium, Luxembourg, parts of France, and in Germany approximately between the River Main and the southern edge of the Lowlands.
Modern Standard German is mostly based on Central German, but it should be noted that the common (but not linguistically correct) German term for modern Standard German is Hochdeutsch, that is, High German.
The Moselle Franconian varieties spoken in Luxembourg have been officially standardised and institutionalised and are therefore usually considered a separate language known as Luxembourgish.
Upper German dialects include Alemannic (for instance Swiss German), Swabian, East Franconian, Alsatian and Austro-Bavarian.
They are spoken in parts of the Alsace, southern Germany, Liechtenstein, Austria, and in the German-speaking parts of Switzerland and Italy.
Wymysorys, Sathmarisch and Siebenbürgisch are High German dialects of Poland and Romania respectively.
The High German varieties spoken by Ashkenazi Jews (mostly in the former Soviet Union) have several unique features, and are usually considered as a separate language, Yiddish.
It is the only Germanic language that does not use the Latin alphabet as its standard script.
German dialects versus varieties of standard German
In German linguistics, German dialects are distinguished from varieties of standard German.
The German dialects are the traditional local varieties.
They are traditionally traced back to the different German tribes.
Many of them are hardly understandable to someone who knows only standard German, since they often differ from standard German in lexicon, phonology and syntax.
If a narrow definition of language based on mutual intelligibility is used, many German dialects are considered to be separate languages (for instance in the Ethnologue).
However, such a point of view is unusual in German linguistics.
The varieties of standard German refer to the different local varieties of the pluricentric standard German.
They only differ slightly in lexicon and phonology.
In certain regions, they have replaced the traditional German dialects, especially in Northern Germany.
Grammar
German is an inflected language.
Noun inflection
German nouns inflect into:
one of four cases: nominative, genitive, dative, and accusative.
one of three genders: masculine, feminine, or neuter.
Word endings sometimes reveal grammatical gender; for instance, nouns ending in ...ung(-ing), ...e,...schaft(-ship), ...keit or ...heit(-hood) are feminine, while nouns ending in ...chen or ...lein (diminutive forms) are neuter and nouns ending in ...ismus (-ism) are masculine.
Others are controversial, sometimes depending on the region in which it is spoken.
Additionally, ambiguous endings exist, such as ...er (-er), e.g. Feier (feminine), engl. celebration, party, and Arbeiter (masculine), engl. labourer.
Sentences can usually be reorganized to avoid a misunderstanding.
two numbers: singular and plural
Although German is usually cited as an outstanding example of a highly inflected language, the degree of inflection is considerably less than in Old German, or in other old Indo-European languages such as Latin, Ancient Greek, or Sanskrit.
The three genders have collapsed in the plural, which now behaves, grammatically, somewhat as a fourth gender.
With four cases and three genders plus plural there are 16 distinct possible combinations of case and gender/number, but presently there are only six forms of the definite article used for the 16 possibilities.
Inflection for case on the noun itself is required in the singular for strong masculine and neuter nouns in the genitive and sometimes in the dative.
Both of these cases are losing way to substitutes in informal speech.
The dative ending is considered somewhat old-fashioned in many contexts and often dropped, but it is still used in sayings and in formal speech or in written language.
Weak masculine nouns share a common case ending for genitive, dative and accusative in the singular.
Feminines are not declined in the singular.
The plural does have an inflection for the dative.
In total, seven inflectional endings (not counting plural markers) exist in German: -s, -es, -n, -ns, -en, -ens, -e.
In the German orthography, nouns and most words with the syntactical function of nouns are capitalised, which is supposed to make it easier for readers to find out what function a word has within the sentence (Am Freitag bin ich einkaufen gegangen. — "On Friday I went shopping."; Eines Tages war er endlich da. — "One day he finally showed up".)
This spelling convention is almost unique to German today (shared perhaps only by the closely related Luxemburgish language), although it was historically common in other languages (e.g., Danish and English), too.
Like most Germanic languages, German forms left-branching noun compounds, where the first noun modifies the category given by the second, for example: Hundehütte (eng. dog hut; specifically: doghouse).
Unlike English, where newer compounds or combinations of longer nouns are often written in open form with separating spaces, German (like the other German languages) nearly always uses the closed form without spaces, for example: Baumhaus (eng. tree house).
Like English, German allows arbitrarily long compounds, but these are rare.
(See also English compounds.)
The longest German word verified to be actually in (albeit very limited) use is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz. [which, literally translated, breaks up into: Rind (cattle) - Fleisch (meat) - Etikettierung(s) (labelling) - Überwachung(s) (supervision) - Aufgaben (duties) - Übertragung(s) (assignment) - Gesetz (law), so "Beef labelling supervision duty assignment law".]
Verb inflection
Standard German verbs inflect into:
one of two conjugation classes, weak and strong (like English).
(There is actually a third class, known as mixed verbs, which exhibit inflections combining features of both the strong and weak patterns.)
three persons: 1st, 2nd, 3rd.
two numbers: singular and plural
three moods: Indicative, Subjunctive, Imperative
two genera verbi: active and passive; the passive being composed and dividable into static and dynamic.
two non-composed tenses (present, preterite) and four composed tenses (perfect, pluperfect, future and future perfect)
distinction between grammatical aspects is rendered by combined use of subjunctive and/or preterite marking; thus: neither of both is plain indicative voice, sole subjunctive conveys second-hand information, subjunctive plus Preterite marking forms the conditional state, and sole preterite is either plain indicative (in the past), or functions as a (literal) alternative for either second-hand-information or for the conditional state of the verb, when one of them may seem indistinguishable otherwise.
distinction between perfect and progressive aspect is and has at every stage of development been at hand as a productive category of the older language and in nearly all documented dialects, but, strangely enough, is nowadays rigorously excluded from written usage in its present normalised form.
disambiguation of completed vs. uncompleted forms is widely observed and regularly generated by common prefixes (blicken - to look, erblicken - to see [unrelated form: sehen - to see]).
Verb prefixes
There are also many ways to expand, and sometimes radically change, the meaning of a base verb through a relatively small number of prefixes.
Some of those prefixes have a meaning themselves (Example: zer- refers to the destruction of things, as in zerreißen = to tear apart, zerbrechen = to break apart, zerschneiden = to cut apart), others do not have more than the vaguest meaning in and of themselves (Example: ver- , as in versuchen = to try, vernehmen = to interrogate, verteilen = to distribute, verstehen = to understand).
More examples: haften = to stick, verhaften = to imprison; kaufen = to buy, verkaufen = to sell; hören = to hear, aufhören = to cease; fahren = to drive, erfahren = to get to know, to hear about something.
Separable prefixes
Many German verbs have a separable prefix, often with an adverbial function.
In finite verb forms this is split off and moved to the end of the clause, and is hence considered by some to be a "resultative particle".
For example, mitgehen meaning "to go with" would be split giving Gehen Sie mit?
(Literal: "Go you with?" ; Formal: "Are you going along"?).
Indeed, several parenthetical clauses may occur between the prefix of a finite verb and its complement; e.g.
Er kam am Freitagabend nach einem harten Arbeitstag und dem üblichen Ärger, der ihn schon seit Jahren immer wieder an seinem Arbeitsplatz plagt, mit fraglicher Freude auf ein Mahl, das seine Frau ihm, wie er hoffte, bereits aufgetischt hatte, endlich zu Hause an .
A literal translation of this example might look like this:
He arr- on a Friday evening after a hard day at work and the usual disagreements that had been troubling him repeatedly, looking forward to a questionable meal which, as he hoped, his wife had already fixed for him, -ived at home.
Word order
German requires that a verbal element (main verb or auxiliary verb) appear second in the sentence, preceded by the most important topical phrase.
The second most important phrase appears at the end of the sentence.
For a sentence without an auxiliary, this gives several options:
(The old man gives me the book today)
(stress on mir)
(as well)
The position of a noun as a subject or object in a German sentence doesn't affect the meaning of the sentence as it would in English.
In a declarative sentence in English if the subject does not occur before the predicate the sentence could well be misunderstood.
For example, in the sentence "Man bites dog" it is clear who did what to whom.
To exchange the place of the subject with that of the object — "Dog bites man" — changes the meaning completely.
In other words the word order in a sentence conveys significant information.
In German, nouns and articles are declined as in Latin thus indicating whether it is the subject or object of the verb's action.
The above example in German would be or with both having exactly the same meaning.
If the articles are omitted, which is sometimes done in headlines (), the syntax applies as in English — the first noun is the subject and the noun following the predicate is the object.
Except for emphasis, adverbs of time have to appear in the third place in the sentence, just after the predicate.
Otherwise the speaker would be recognised as non-German.
For instance the German word order (in Modern English) is: We're going tomorrow to town. ()
Auxiliary verbs
When an auxiliary verb is present, the auxiliary appears in second position, and the main verb appears at the end.
This occurs notably in the creation of the perfect tense.
Many word orders are still possible, e.g.:
(The old man gave me the book yesterday.)
The word order is generally less rigid than in Modern English except for nouns (see below).
There are two common word orders; one is for main clauses and another for subordinate clauses.
In normal positive sentences the inflected verb always has position 2; in questions, exclamations and wishes it always has position 1.
In subordinate clauses the verb is supposed to occur at the very end, but in speech this rule is often disregarded.
For example in a subordinate clause introduced by "weil" ("because") the verb quite often occupies the same order as in a main clause.
The correct way of saying "because I'm broke" is "".
In the vernacular you may hear instead ""
This phenomenon may be caused by mixing the word-order pattern used for the word with the pattern used for an alternative word for "because", , which is used with the main clause order ("").
Modal verbs
Sentences using modal verbs place the infinitive at the end.
For example, the sentence in Modern English "Should he go home?" would be rearranged in German to say "Should he (to) home go?" ().
Thus in sentences with several subordinate or relative clauses the infinitives are clustered at the end.
Compare the similar clustering of prepositions in the following English sentence: "What did you bring that book that I don't like to be read to out of up for?"
Multiple infinitives
The number of infinitives at the end is usually restricted to two, causing the third infinitive or auxiliary verb that would have gone at the very end to be placed instead at the beginning of the chain of verbs.
For example in the sentence "Should he move into the house that he just has had renovated?" would be rearranged to "Should he into the house move, that he just renovated had?".
().
The older form would have been ().
If there are more than three infinitives, all except the first two are relocated to the beginning of the chain.
Needless to say the rule is not rigorously applied.
Vocabulary
Most German vocabulary is derived from the Germanic branch of the Indo-European language family, although there are significant minorities of words derived from Latin, and Greek, and a smaller amount from French and most recently English .
At the same time, the effectiveness of the German language in forming equivalents for foreign words from its inherited Germanic stem repertory is great.
Thus, Notker Labeo was able to translate Aristotelian treatises in pure (Old High) German in the decades after the year 1000.
Overall, German has fewer Romance-language loanwords than does English.
The coining of new, autochthonous words gave German a vocabulary of an estimated 40,000 words as early as the ninth century.
In comparison, Latin, with a written tradition of nearly 2,500 years in an empire which ruled the Mediterranean, has grown to no more than 45,000 words today.
Even today, many low-key scholarly movements try to promote the Ersatz (substitution) of virtually all foreign words with ancient, dialectal, or neologous German alternatives.
It is claimed that this would also help in spreading modern or scientific notions among the less educated, and thus democratise public life, too.
Jurisprudence in Germany, for example, uses perhaps the "purest" tongue in terms of "Germanness", but also the most cumbersome, to be found today..
In the modern scientific German vocabulary data base in Leipzig (as of July 2003) there are nine million words and word groups in 35 million sentences (out of a corpus of 500 million words).
Writing system
Present
German is written using the Latin alphabet.
In addition to the 26 standard letters, German has three vowels with Umlaut, namely ä, ö and ü, as well as the Eszett or scharfes s (sharp s), ß.
Before the German spelling reform of 1996, ß replaced ss after long vowels and diphthongs and before consonants, word-, or partial-word-endings.
In reformed spelling, ß replaces ss only after long vowels and diphthongs.
Since there is no capital ß, it is always written as SS when capitalization is required.
For example, Maßband (tape measure) is capitalized MASSBAND.
An exception is the use of ß in legal documents and forms when capitalizing names.
To avoid confusion with similar names, a "ß" is to be used instead of "SS".
(So: "KREßLEIN" instead of "KRESSLEIN".)
A capital ß has been proposed and included in Unicode, but it is not yet recognized as standard German.
In Switzerland, ß is not used at all.
Umlaut vowels (ä, ö, ü) are commonly circumscribed with ae, oe, and ue if the umlauts are not available on the keyboard used.
In the same manner ß can be circumscribed as ss. German readers understand those circumscriptions (although they look unusual), but they are avoided if the regular umlauts are available because they are considered a makeshift, not proper spelling.
(In Westphalia, city and family names exist where the extra e has a vowel lengthening effect, e.g. Raesfeld [ˈraːsfɛlt] and Coesfeld [ˈkoːsfɛlt], but this use of the letter e after a/o/u does not occur in the present-day spelling of words other than proper nouns.
)
Unfortunately there is still no general agreement exactly where these umlauts occur in the sorting sequence.
Telephone directories treat them by replacing them with the base vowel followed by an e, whereas dictionaries use just the base vowel.
As an example in a telephone book Ärzte occurs after Adressenverlage but before Anlagenbauer (because Ä is replaced by Ae).
In a dictionary Ärzte occurs after Arzt but before Asbest (because Ä is treated as A).
In some older dictionaries or indexes, initial Sch and St are treated as separate letters and are listed as separate entries after S.
Past
Until the early 20th century, German was mostly printed in blackletter typefaces (mostly in Fraktur, but also in Schwabacher) and written in corresponding handwriting (for example Kurrent and Sütterlin).
These variants of the Latin alphabet are very different from the serif or sans serif Antiqua typefaces used today, and particularly the handwritten forms are difficult for the untrained to read.
The printed forms however were claimed by some to be actually more readable when used for printing Germanic languages .
The Nazis initially promoted Fraktur and Schwabacher since they were considered Aryan, although they later abolished them in 1941 by claiming that these letters were Jewish.
The latter fact is not widely known anymore; today the letters are often associated with the Nazis and are no longer commonly used .
The Fraktur script remains present in everyday life through road signs, pub signs, beer brands and other forms of advertisement, where it is used to convey a certain rusticality and oldness.
A proper use of the long s, (langes s), ſ, is essential to write German text in Fraktur typefaces.
Many Antiqua typefaces include the long s, also.
A specific set of rules applies for the use of long s in German text, but it is rarely used in Antiqua typesetting, recently.
Any lower case "s" at the beginning of a syllable would be a long s, as opposed to a terminal s or short s (the more common variation of the letter s), which marks the end of a syllable; for example, in differentiating between the words Wachſtube (=guard-house) and Wachstube (=tube of floor polish).
One can decide which "s" to use by appropriate hyphenation, easily ("Wach-ſtube" vs. "Wachs-tube").
The long s only appears in lower case.
The widespread ignorance of the correct use of the Fraktur scripts shows however in the many mistakes made— such as the frequent erroneous use of the round s instead of the long s at the beginning of a syllable, the failure to employ the mandatory ligatures of Fraktur, or the use of letter-forms more alike to the Antiqua for certain especially hard-to-read Fraktur letters.
Phonology
Vowels
German vowels (excluding diphthongs; see below) come in short and long varieties, as detailed in the following table:
Short is realised as in stressed syllables (including secondary stress), but as in unstressed syllables.
Note that stressed short can be spelled either with e or with ä (hätte 'would have' and Kette 'chain', for instance, rhyme).
In general, the short vowels are open and the long vowels are closed.
The one exception is the open sound of long Ä; in some varieties of standard German, and have merged into , removing this anomaly.
In that case, pairs like Bären/Beeren 'bears/berries' or Ähre/Ehre 'spike/honour' become homophonous).
In many varieties of standard German, an unstressed is not pronounced as , but vocalised to .
Whether any particular vowel letter represents the long or short phoneme is not completely predictable, although the following regularities exist:
If a vowel (other than i) is at the end of a syllable or followed by a single consonant, it is usually pronounced long (e.g. Hof [hoːf]).
If the vowel is followed by a double consonant (e.g. ff, ss or tt), ck, tz or a consonant cluster (e.g. st or nd), it is nearly always short (e.g. hoffen [ˈhɔfǝn]).
Double consonants are used only for this function of marking preciding vowels as short; the consonant itself is never pronounced lengthened or doubled.
Both of these rules have exceptions (e.g. hat [hat] 'has' is short despite the first rule; Kloster , 'cloister'; Mond , 'moon' are long despite the second rule).
For an i that is neither in the combination ie (making it long) nor followed by a double consonant or cluster (making it short), there is no general rule.
In some cases, there are regional differences: In central Germany (Hessen), the o in the proper name "Hoffmann" is pronounced long while most other Germans would pronounce it short; the same applies to the e in the geographical name "Mecklenburg" for people in that region.
The word Städte 'cities', is pronounced with a short vowel by some (Jan Hofer, ARD Television) and with a long vowel by others (Marietta Slomka, ZDF Television).
Finally, a vowel followed by ch can be short (Fach 'compartment', Küche 'kitchen') or long (Suche 'search', Bücher 'books') almost at random.
Thus, Lache is homographous: 'puddle' and 'manner of laughing' (coll.), 'laugh!'
(Imp.).
German vowels can form the following digraphs (in writing) and diphthongs (in pronunciation); note that the pronunciation of some of them (ei, äu, eu) is very different from what one would expect when considering the component letters:
Additionally, the digraph ie generally represents the phoneme , which is not a diphthong.
In many varieties, a /r/ at the end of a syllable is vocalised.
However, a sequence of a vowel followed by such a vocalised /r/ is not considered a diphthong: Bär 'bear', er 'he', wir 'we', Tor 'gate', kurz 'short', Wörter 'words'.
In most varieties of standard German, word stems that begin with a vowel are preceded by a glottal stop [ʔ].
Consonants
c standing by itself is not a German letter.
In borrowed words, it is usually pronounced [ʦ] (before ä, äu, e, i, ö, ü, y) or [k] (before a, o, u, or before consonants).
The combination ck is, as in English, used to indicate that the preceding vowel is short.
ch occurs most often and is pronounced either [ç] (after ä, ai, äu, e, ei, eu, i, ö, ü and after consonants) or [x] (after a, au, o, u).
Ch never occurs at the beginning of an originally German word.
In borrowed words with initial Ch there is no single agreement on the pronunciation.
For example, the word "Chemie" (chemistry) can be pronounced [keːˈmiː], [çeːˈmiː] or [ʃeːˈmiː] depending on dialect.
dsch is pronounced ʤ (like j in Jungle) but appears in a few loanwords only.
f is pronounced [f] as in "father".
h is pronounced [h] like in "home" at the beginning of a syllable.
After a vowel it is silent and only lengthens the vowel (e.g. "Reh" = roe deer).
j is pronounced [j] in Germanic words ("Jahr" [jaːɐ]).
In younger loanwords, it follows more or less the respective languages' pronunciations.
l is always pronounced [l], never [ɫ] (the English "Dark L").
q only exists in combination with u and appears both in Germanic and Latin words ("quer"; "Qualität").
It is pronounced [kv].
r is pronounced as a guttural sound (an uvular trill, [ʀ]) in front of a vowel or consonant ("Rasen" [ʀaːzən]; "Burg" like [buʀg]).
In spoken German however, it is commonly vocalised after a vowel ("er" being pronounced rather like ['ɛɐ] - "Burg" [buɐg]).
In some southern non-standard varieties, the r is pronounced as a tongue-tip r (the alveolar trill).
s in Germany, is pronounced [z] (as in "Zebra") if it forms the syllable onset (e.g. Sohn [zoːn]), otherwise [s] (e.g. Bus [bʊs]).
In Austria, always pronounced [s].
A ss [s] indicates that the preceding vowel is short. st and sp at the beginning of words of German origin are pronounced [ʃt] and [ʃp], respectively.
ß (a letter unique to German called "Esszet") was a ligature of a double s and of a sz and is always pronounced [s].
Originating in Blackletter typeface, it traditionally replaced ss at the end of a syllable (e.g. "ich muss" → "ich muß"; "ich müsste" → "ich müßte"); within a word it contrasts with ss [s] in indicating that the preceding vowel is long (compare "in Maßen" [in 'maːsən] "with moderation" and "in Massen" [in 'masən] "in loads").
The use of ß has recently been limited by the latest German spelling reform and is no longer used for ss at the end of a syllable; Switzerland and Liechtenstein already abolished it in 1934.
sch is pronounced [ʃ] (like "sh" in "Shine").
v is pronounced [f] in words of Germanic origin (e.g. "Vater" [ˈfaːtɐ]) and [v] in most other words (e.g. "Vase" [ˈvaːzǝ]).
w is pronounced [v] like in "vacation" (e.g. "was" [vas]).
y only appears in loanwords and is traditionally considered a vowel.
z is always pronounced [ʦ] (e.g. "zog" [ʦoːk]).
A tz indicates that the preceding vowel is short.
Consonant shifts
German does not have any dental fricatives (as English th).
The th sounds, which the English language has inherited from Anglo Saxon, survived on the continent up to Old High German and then disappeared in German with the consonant shifts between the 8th and the 10th century.
It is sometimes possible to find parallels between German by replacing the English th with d in German: "Thank" → in German "Dank", "this" and "that" → "dies" and "das", "thou" (old 2nd person singular pronoun) → "du", "think" → "denken", "thirsty" → "durstig" and many other examples.
Likewise, the gh in Germanic English words, pronounced in several different ways in modern English (as an f, or not at all), can often be linked to German ch: "to laugh" → "lachen", "through" and "thorough" → "durch", "high" → "hoch", "naught" → "nichts", etc.
Cognates with English
There are many thousands of German words that are cognate to English words (in fact a sizeable fraction of native German and English vocabulary, although for various reasons much of it is not immediately obvious).
Most of the words in the following table have almost the same meaning as in English.
Compound word cognates
When these cognates have slightly different consonants, this is often due to the High German consonant shift.
Hence the affinity of English words with those of German dialects is more evidently:
There are cognates whose meanings in either language have changed through the centuries.
It is sometimes difficult for both English and German speakers to discern the relationship.
On the other hand, once the definitions are made clear, then the logical relation becomes obvious.
Sometimes the generality or specificity of word pairs may be opposite in the two languages.
German and English also share many borrowings from other languages, especially Latin, French and Greek.
Most of these words have the same meaning, while a few have subtle differences in meaning.
As many of these words have been borrowed by numerous languages, not only German and English, they are called internationalisms in German linguistics.
For reference, a good number of these borrowed words are of the neuter gender.
Words borrowed by English
For a list of German loanwords in English, see :Category:German loanwords
In the English language, there are also many words taken from German without any letter change, e.g.:
Names for German in other languages
See also: Deutsch, Dutch, Deitsch, Dietsch, Teuton, Teutonic, Allemanic, Alleman, Theodisca
The names that countries have for the language differ from region to region.
In Italian the sole name for German is still tedesco, from the Latin theodiscus, meaning "vernacular".
A possible explanation for the use of words meaning "mute" (e.g., nemoj in Russian, němý in Czech, nem in Serbian) to refer to German (and also to Germans) in Slavic languages is that Germans were the first people Slavic tribes encountered with whom they could not communicate.
Romanian used to use the Slavonic term "nemţeşte", but "germană" is now widely used.
Hungarian "német" is also of Slavonic origin.
The Arabic name for Austria, النمسا ("an-namsa"), is derived from the Slavonic term.
Note also that though the Russian term for the language is немецкий (nemetskij), the country is Германия (Germania).
However, in certain other Slavic languages, such as Czech, the country name (Německo) is similar to the name of the language, německý (jazyk).
Finns and Estonians use the term saksa, originally from the Saxon tribe.
Scandinavians use derivatives of the word Tyskland/Þýskaland (from Theodisca) for the country and tysk(a)/þýska for the language.
Hebrew traditionally (nowadays this is not the case) used the Biblical term אַשְׁכֲּנָז (Ashkenaz) (Genesis 10:3) to refer to Germany, or to certain parts of it, and the Ashkenazi Jews are those who originate from Germany and Eastern Europe and formerly spoke Yiddish as their native language, derived from Middle High German.
Modern Hebrew uses גֶּרְמָנִי germaní (Or גֶּרְמָנִית germanít for the language).
The French term is allemand, the Spanish term is alemán, the Catalan term is alemany, and the Portuguese term is alemão; all derive from the ancient Alamanni tribal alliance, meaning literally "All Men".
The Latvian term vācu means "tinny" and refers disparagingly to the iron-clad Teutonic Knights that colonized the Baltic in the Middle Ages.
The Scottish Gaelic term for the German language, Gearmailtis, is formed in the standard way of adding -(a)is to the end of the country name.
See Names for Germany for further details on the origins of these and other terms.
GNU General Public License
The GNU General Public License (GNU GPL or simply GPL) is a widely used free software license, originally written by Richard Stallman for the GNU project.
The GPL is the most popular and well-known example of the type of strong copyleft license that requires derived works to be available under the same copyleft.
Under this philosophy, the GPL is said to grant the recipients of a computer program the rights of the free software definition and uses copyleft to ensure the freedoms are preserved, even when the work is changed or added to.
This is in distinction to permissive free software licenses, of which the BSD licenses are the standard examples.
The GNU Lesser General Public License (LGPL) is a modified, more permissive, version of the GPL, originally intended for some software libraries.
There is also a GNU Free Documentation License, which was originally intended for use with documentation for GNU software, but has also been adopted for other uses, such as the Wikipedia project.
The Affero General Public License (GNU AGPL) is a similar license with a focus on networking server software.
The GNU AGPL is similar to the GNU General Public License, except that it additionally covers the use of the software over a computer network, requiring that the complete source code be made available to any network user of the AGPLed work, for example a web application.
The Free Software Foundation recommends that this license is considered for any software that will commonly be run over the network.
History
The GPL was written by Richard Stallman in 1989 for use with programs released as part of the GNU project.
The original GPL was based on a unification of similar licenses used for early versions of GNU Emacs, the GNU Debugger and the GNU Compiler Collection.
These licenses contained similar provisions to the modern GPL, but were specific to each program, rendering them incompatible, despite being the same license.
Stallman's goal was to produce one license that could be used for any project, thus making it possible for many projects to share code.
An important vote of confidence in the GPL came from Linus Torvalds' adoption of the license for the Linux kernel in 1992, switching from an earlier license that prohibited commercial distribution.
As of August 2007, the GPL accounted for nearly 65% of the 43,442 free software projects listed on Freshmeat, and as of January 2006, about 68% of the projects listed on SourceForge.net.
Similarly, a 2001 survey of Red Hat Linux 7.1 found that 50% of the source code was licensed under the GPL and a 1997 survey of MetaLab, then the largest free software archive, showed that the GPL accounted for about half of the licenses used.
One survey of a large repository of open-source software reported that in July 1997, about half the software packages with explicit license terms used the GPL.
Prominent free software programs licensed under the GPL include the Linux kernel and the GNU Compiler Collection (GCC).
Some other free software programs are dual-licensed under multiple licenses, often with one of the licenses being the GPL.
Some observers believe that the strong copyleft provided by the GPL was crucial to the success of Linux, giving the programmers who contributed to it the confidence that their work would benefit the whole world and remain free, rather than being exploited by software companies that would not have to give anything back to the community.
The second version of the license, version 2, was released in 1991.
Over the following 15 years, some members of the FOSS (Free and Open Source Software) community came to believe that some software and hardware vendors were finding loopholes in the GPL, allowing GPL-licensed software to be exploited in ways that were contrary to the intentions of the programmers.
These concerns included tivoization (the inclusion of GPL-licensed software in hardware that will refuse to run modified versions of its software); the use of unpublished, modified versions of GPL software behind web interfaces; and patent deals between Microsoft and Linux and Unix distributors that may represent an attempt to use patents as a weapon against competition from Linux.
Version 3 was developed to attempt to address these concerns.
It was officially released on June 29, 2007.
Versions
Version 1
Version 1 of the GNU GPL, released in January 1989, prevented what were then the two main ways that software distributors restricted the freedoms that define free software.
The first problem was that distributors may publish binary files only – executable, but not readable or modifiable by humans.
To prevent this, GPLv1 said that any vendor distributing binaries must also make the human readable source code available under the same licensing terms.
The second problem was the distributors might add additional restrictions, either by adding restrictions to the license, or by combining the software with other software which had other restrictions on its distribution.
If this was done, then the union of the two sets of restrictions would apply to the combined work, thus unacceptable restrictions could be added.
To prevent this, GPLv1 said that modified versions, as a whole, had to be distributed under the terms in GPLv1.
Therefore, software distributed under the terms of GPLv1 could be combined with software under more permissive terms, as this would not change the terms under which the whole could be distributed, but software distributed under GPLv1 could not be combined with software distributed under a more restrictive license, as this would conflict with the requirement that the whole be distributable under the terms of GPLv1.
Version 2
According to Richard Stallman, the major change in GPLv2 was the "Liberty or Death" clause, as he calls it - Section 7.
This section says that if someone has restrictions imposed that prevent him or her from distributing GPL-covered software in a way that respects other users' freedom (for example, if a legal ruling states that he or she can only distribute the software in binary form), he or she cannot distribute it at all.
By 1990, it was becoming apparent that a less restrictive license would be strategically useful for some software libraries; when version 2 of the GPL (GPLv2) was released in June 1991, therefore, a second license - the Library General Public License (LGPL) was introduced at the same time and numbered with version 2 to show that both were complementary.
The version numbers diverged in 1999 when version 2.1 of the LGPL was released, which renamed it the GNU Lesser General Public License to reflect its place in the GNU philosophy.
Version 3
In late 2005, the Free Software Foundation (FSF) announced work on version 3 of the GPL (GPLv3).
On January 16, 2006, the first "discussion draft" of GPLv3 was published, and the public consultation began.
The public consultation was originally planned for nine to fifteen months but finally stretched to eighteen months with four drafts being published.
The official GPLv3 was released by FSF on June 29, 2007.
GPLv3 was written by Richard Stallman, with legal counsel from Eben Moglen and Software Freedom Law Center.
According to Stallman, the most important changes are in relation to software patents, free software license compatibility, the definition of "source code", and hardware restrictions on software modification ("tivoization").
Other changes relate to internationalisation, how license violations are handled, and how additional permissions can be granted by the copyright holder.
Other notable changes include allowing authors to add certain additional conditions or requirements to their contributions.
One of those new optional requirements, sometimes referred to as the Affero clause, is intended to fulfill a request regarding software as a service; the permitting addition of this requirement makes GPLv3 compatible with the Affero General Public License.
The public consultation process was coordinated by the Free Software Foundation with assistance from Software Freedom Law Center, Free Software Foundation Europe, and other free software groups.
Comments were collected from the public via the gplv3.fsf.org web portal.
That portal runs purpose-written software called stet.
These comments were passed to four committees comprising approximately 130 people, including supporters and detractors of FSF's goals.
Those committees researched the comments submitted by the public and passed their summaries to Stallman for a decision on what the license would do.
During the public consultation process, 962 comments were submitted for the first draft.
By the end, a total of 2,636 comments had been submitted.
The third draft was released on March 28, 2007.
This draft included language intended to prevent patent cross-licenses like the controversial Microsoft-Novell patent agreement and restricts the anti-tivoization clauses to a legal definition of a "User" or "consumer product."
It also explicitly removed the section on "Geographical Limitations", whose probable removal had been announced at the launch of the public consultation.
The fourth discussion draft, which was the last, was released on May 31, 2007.
It introduced Apache Software License compatibility, clarified the role of outside contractors, and made an exception to permit the Microsoft-Novell agreement, saying in section 11 paragraph 6 that
This aims to make future such deals ineffective.
The license is also meant to cause Microsoft to extend the patent licenses it grants to Novell customers for the use of GPLv3 software to all users of that GPLv3 software; this is possible only if Microsoft is legally a "conveyor" of the GPLv3 software.
Others, notably some high-profile developers of the Linux kernel, commented to the mass media and made public statements about their objections to parts of discussion drafts 1 and 2.
Terms and conditions
The terms and conditions of the GPL are available to anybody receiving a copy of the work that has a GPL applied to it ("the licensee").
Any licensee who adheres to the terms and conditions is given permission to modify the work, as well as to copy and redistribute the work or any derivative version.
The licensee is allowed to charge a fee for this service, or do this free of charge.
This latter point distinguishes the GPL from software licenses that prohibit commercial redistribution.
The FSF argues that free software should not place restrictions on commercial use, and the GPL explicitly states that GPL works may be sold at any price.
The GPL additionally states that a distributor may not impose "further restrictions on the rights granted by the GPL".
This forbids activities such as distributing of the software under a non-disclosure agreement or contract.
Distributors under the GPL also grant a license for any of their patents practiced by the software, to practice those patents in GPL software.
Section three of the license requires that programs distributed as pre-compiled binaries are accompanied by a copy of the source code, a written offer to distribute the source code via the same mechanism as the pre-compiled binary or the written offer to obtain the source code that you got when you received the pre-compiled binary under the GPL.
Copyleft
The distribution rights granted by the GPL for modified versions of the work are not unconditional.
When someone distributes a GPL'd work plus their own modifications, the requirements for distributing the whole work cannot be any greater than the requirements that are in the GPL.
This requirement is known as copyleft.
It earns its legal power from the use of copyright on software programs.
Because a GPL work is copyrighted, a licensee has no right to redistribute it, not even in modified form (barring fair use), except under the terms of the license.
One is only required to adhere to the terms of the GPL if one wishes to exercise rights normally restricted by copyright law, such as redistribution.
Conversely, if one distributes copies of the work without abiding by the terms of the GPL (for instance, by keeping the source code secret), he or she can be sued by the original author under copyright law.
Copyleft thus uses copyright law to accomplish the opposite of its usual purpose: instead of imposing restrictions, it grants rights to other people, in a way that ensures the rights cannot subsequently be taken away.
It also ensures that unlimited redistribution rights are not granted, should any legal flaw (or "bug") be found in the copyleft statement.
Many distributors of GPL'ed programs bundle the source code with the executables.
An alternative method of satisfying the copyleft is to provide a written offer to provide the source code on a physical medium (such as a CD) upon request.
In practice, many GPL'ed programs are distributed over the Internet, and the source code is made available over FTP.
For Internet distribution, this complies with the license.
Copyleft applies only when a person seeks to redistribute the program.
One is allowed to make private modified versions, without any obligation to divulge the modifications as long as the modified software is not distributed to anyone else.
Note that the copyleft applies only to the software and not to its output (unless that output is itself a derivative work of the program); for example, a public web portal running a modified derivative of a GPL'ed content management system is not required to distribute its changes to the underlying software.
Licensing and contractual issues
The GPL was designed as a license, rather than a contract.
In some Common Law jurisdictions, the legal distinction between a license and a contract is an important one: contracts are enforceable by contract law, whereas licenses are enforced under copyright law.
However, this distinction is not useful in the many jurisdictions where there are no differences between contracts and licenses, such as Civil Law systems.
Those who do not agree to the GPL's terms and conditions do not have permission, under copyright law, to copy or distribute GPL licensed software or derivative works.
However, they may still use the software however they like.
Copyright holders
The text of the GPL is itself copyrighted, and the copyright is held by the Free Software Foundation (FSF).
However, the FSF does not hold the copyright for a work released under the GPL, unless an author explicitly assigns copyrights to the FSF (which seldom happens except for programs that are part of the GNU project).
Only the individual copyright holders have the authority to sue when a license violation takes place.
The FSF permits people to create new licenses based on the GPL, as long as the derived licenses do not use the GPL preamble without permission.
This is discouraged, however, since such a license is generally incompatible with the GPL.
(See the GPL FAQ for more information.)
Other licenses created by the GNU project include the GNU Lesser General Public License and the GNU Free Documentation License.
The GPL in court
A key dispute related to the GPL is whether or not non-GPL software can dynamically link to GPL libraries.
The GPL is clear in requiring that all derivative works of GPL'ed code must themselves be GPL'ed.
However, it is not clear whether an executable that dynamically links to a GPL code should be considered a derivative work.
The free/open-source software community is split on this issue.
The FSF asserts that such an executable is indeed a derivative work if the executable and GPL code "make function calls to each other and share data structures," with others agreeing, while some (e.g. Linus Torvalds) agree that dynamic linking can create derived works but disagree over the circumstances.
On the other hand, some experts have argued that the question is still open: one Novell lawyer has written that dynamic linking not being derivative "makes sense" but is not "clear-cut," and Lawrence Rosen has claimed that a court of law would "probably" exclude dynamic linking from derivative works although "there are also good arguments" on the other side and "the outcome is not clear" (on a later occasion, he argued that "market-based" factors are more important than the linking technique).
This is ultimately a question not of the GPL per se, but of how copyright law defines derivative works.
In Galoob v. Nintendo the Ninth Circuit Court of Appeals defined a derivative work as having "'form' or permanence" and noted that "the infringing work must incorporate a portion of the copyrighted work in some form," but there have been no clear court decisions to resolve this particular conflict.
Since there is no record of anyone circumventing the GPL by dynamic linking and contesting when threatened with lawsuits by the copyright holder, the restriction appears de facto enforceable even if not yet proven de jure.
In 2002, MySQL AB sued Progress NuSphere for copyright and trademark infringement in United States district court.
NuSphere had allegedly violated MySQL's copyright by linking code for the Gemini table type into the MySQL server.
After a preliminary hearing before Judge Patti Saris on February 27, 2002, the parties entered settlement talks and eventually settled.
At the hearing, Judge Saris "saw no reason" that the GPL would not be enforceable.
In August 2003, the SCO Group stated that they believed the GPL to have no legal validity, and that they intended to take up lawsuits over sections of code supposedly copied from SCO Unix into the Linux kernel.
This was a problematic stand for them, as they had distributed Linux and other GPL'ed code in their Caldera OpenLinux distribution, and there is little evidence that they had any legal right to do so except under the terms of the GPL.
For more information, see SCO-Linux controversies and SCO v. IBM.
In April 2004 the netfilter/iptables project was granted a preliminary injunction against Sitecom Germany by Munich District Court after Sitecom refused to desist from distributing Netfilter's GPL'ed software in violation of the terms of the GPL.
On July 2004 , the German court confirmed this injunction as a final ruling against Sitecom.
The court's justification for its decision exactly mirrored the predictions given earlier by the FSF's Eben Moglen:
Defendant has infringed on the copyright of plaintiff by offering the software 'netfilter/iptables' for download and by advertising its distribution, without adhering to the license conditions of the GPL.
Said actions would only be permissible if defendant had a license grant...
This is independent of the questions whether the licensing conditions of the GPL have been effectively agreed upon between plaintiff and defendant or not.
If the GPL were not agreed upon by the parties, defendant would notwithstanding lack the necessary rights to copy, distribute, and make the software 'netfilter/iptables' publicly available.
This ruling was important because it was the first time that a court had confirmed that violating terms of the GPL was an act of copyright violation.
However, the case was not as crucial a test for the GPL as some have concluded.
In the case, the enforceability of GPL itself was not under attack.
Instead, the court was merely attempting to discern if the license itself was in effect.
In May of 2005, Daniel Wallace filed suit against the Free Software Foundation (FSF) in the Southern District of Indiana, contending that the GPL is an illegal attempt to fix prices at zero.
The suit was dismissed in March 2006, on the grounds that Wallace had failed to state a valid anti-trust claim; the court noted that "the GPL encourages, rather than discourages, free competition and the distribution of computer operating systems, the benefits of which directly pass to consumers."
Wallace was denied the possibility of further amending his complaint, and was ordered to pay the FSF's legal expenses.
On September 8, 2005, Seoul Central District Court ruled that GPL has no legal relevance concerning the case dealing with trade secret derived from GPL-licensed work.
Defendants argued that since it is impossible to maintain trade secret while being compliant with GPL and distributing the work, they aren't in breach of trade secret.
This argument was considered without ground.
On September 6, 2006, the gpl-violations.org project prevailed in court litigation against D-Link Germany GmbH regarding D-Link's inappropriate and copyright infringing use of parts of the Linux Operating System Kernel.
The judgment finally provided the on-record, legal precedent that the GPL is valid and legally binding, and that it will stand up in German court.
In late 2007, the developers of BusyBox and the Software Freedom Law Center embarked upon a program to gain GPL compliance from distributors of BusyBox in embedded systems, suing those who would not comply.
These were claimed to be the first US uses of courts for enforcement of GPL obligations.
See BusyBox#GPL lawsuits.
Compatibility and multi-licensing
Many of the most common free software licenses, such as the original MIT/X license, the BSD license (in its current 3-clause form), and the LGPL, are "GPL-compatible".
That is, their code can be combined with a program under the GPL without conflict (the new combination would have the GPL applied to the whole).
However, some free/open source software licenses are not GPL-compatible.
Many GPL proponents have strongly advocated that free/open source software developers use only GPL-compatible licenses, because doing otherwise makes it difficult to reuse software in larger wholes.
Note that this issue only arises in concurrent use of licenses which impose conditions on their manner of combination.
Some licenses, such as the BSD license, impose no conditions on the manner of their combination.
Also see the list of FSF approved software licenses for examples of compatible and incompatible licenses.
A number of businesses use dual-licensing to distribute a GPL version and sell a proprietary license to companies wishing to combine the package with proprietary code, using dynamic linking or not.
Examples of such companies include MySQL AB, Trolltech (Qt toolkit), Namesys (ReiserFS) and Red Hat (Cygwin).
Adoption
The Open Source License Resource Center maintained by Black Duck Software shows that GPL is the license used in about 70% of all open source software.
The vast majority of projects are released under GPL 2 with 3000 open source projects having migrated to GPL 3.
Criticism
In 2001 Microsoft CEO Steve Ballmer referred to Linux as "a cancer that attaches itself in an intellectual property sense to everything it touches."
Critics of Microsoft claim that the real reason Microsoft dislikes the GPL is that the GPL resists proprietary vendors' attempts to "embrace, extend and extinguish".
Microsoft has released Microsoft Windows Services for UNIX which contains GPL-licensed code.
In response to Microsoft's attacks on the GPL, several prominent Free Software developers and advocates released a joint statement supporting the license.
The GPL has been described as being "viral" by many of its critics because the GPL only allows conveyance of whole programs, which means that programmers are not allowed to convey programs that link to libraries having GPL-incompatible licenses.
The so-called "viral" effect of this is that under such circumstances disparately licensed software cannot be combined unless one of the licenses is changed.
Although theoretically either license could be changed, in the "viral" scenario the GPL cannot be practically changed (because the software may have so many contributors, some of whom will likely refuse), whereas the license of the other software can be practically changed.
This is part of a philosophical difference between the GPL and permissive free software licenses such as the BSD-style licenses, which do not put such a requirement on modified versions.
While proponents of the GPL believe that free software should ensure that its freedoms are preserved all the way from the developer to the user, others believe that intermediaries between the developer and the user should be free to redistribute the software as non-free software.
More specifically, the GPL requires that redistribution occur subject to the GPL, whereas more "permissive" licenses allow redistribution to occur under licenses more restrictive than the original license.
While the GPL does allow commercial distribution of GPL software, the market price will settle near the price of distribution—near zero—since the purchasers may redistribute the software and its source code for their cost of redistribution.
This could be seen to inhibit commercial use of GPL'ed code by others wishing to use that code for proprietary purposes—if they don't wish to avail themselves of GPL'ed code, they will have to re-implement it themselves.
Microsoft has included anti-GPL terms in their open source software.
In addition, the FreeBSD project has stated that "a less publicized and unintended use of the GPL is that it is very favorable to large companies that want to undercut software companies.
In other words, the GPL is well suited for use as a marketing weapon, potentially reducing overall economic benefit and contributing to monopolistic behavior".
It's not clear that there are any cases of this happening in practice, however.
The GPL has no indemnification clause explicitly protecting maintainers and developers from litigation resulting from unscrupulous contribution.
(If a developer submits existing patented or copyright work to a GPL project claiming it as their own contribution, all the project maintainers and even other developers can be held legally responsible for damages to the copyright or patent holder.)
Lack of indemnification is one criticism that lead Mozilla to create the Mozilla Public License rather than use the GPL or LGPL.
However, Mozilla later relicensed their work under a GPL/LGPL/MPL triple license, due to problems with the GPL-incompatibility of the MPL.
Some software developers have found the extensive scope of the GPL to be too restrictive.
For example, Bjørn Reese and Daniel Stenberg describe how the downstream effects of the GPL on later developers creates a "quodque pro quo" (Latin, "Everything in return for something").
For that reason, in 2001 they abandoned the GPLv2 in favor of less restrictive copyleft licenses.
A more specific example of the downstream effects of the GPL can be observed through the frame of incompatible licenses.
Sun Microsystems' ZFS, because it is licensed under the GPL-incompatible CDDL and covered by several Sun patents, cannot link to the GPL-licensed linux kernel.
Some have also argued that the GPL could, and should, be shorter.
Google
Google Inc. ( and ) is an American public corporation, earning revenue from advertising related to its Internet search, web-based e-mail, online mapping, office productivity, social networking, and video sharing services as well as selling advertising-free versions of the same technologies.
Google's headquarters, the Googleplex, is located in Mountain View, California.
As of June 30 2008 the company has 19,604 full-time employees.
As of October 31, 2007, it is the largest American company (by market capitalization) that is not part of the Dow Jones Industrial Average.
Google was co-founded by Larry Page and Sergey Brin while they were students at Stanford University and the company was first incorporated as a privately held company on September 7, 1998.
Google's initial public offering took place on August 19, 2004, raising US$1.67 billion, making it worth US$23 billion.
Google has continued its growth through a series of new product developments, acquisitions, and partnerships.
Environmentalism, philanthropy, and positive employee relations have been important tenets during Google's growth, the latter resulting in being identified multiple times as Fortune Magazine's #1 Best Place to Work.
The company's unofficial slogan is "Don't be evil", although criticism of Google include concerns regarding the privacy of personal information, copyright, censorship, and discontinuation of services.
History
Google began in January 1996, as a research project by Larry Page, who was soon joined by Sergey Brin, two Ph.D. students at Stanford University in California.
They hypothesized that a search engine that analyzed the relationships between websites would produce better ranking of results than existing techniques, which ranked results according to the number of times the search term appeared on a page.
Their search engine was originally nicknamed "BackRub" because the system checked backlinks to estimate a site's importance.
A small search engine called Rankdex was already exploring a similar strategy.
Convinced that the pages with the most links to them from other highly relevant web pages must be the most relevant pages associated with the search, Page and Brin tested their thesis as part of their studies, and laid the foundation for their search engine.
Originally, the search engine used the Stanford University website with the domain google.stanford.edu.
The domain google.com was registered on September 15, 1997, and the company was incorporated as Google Inc. on September 7, 1998 at a friend's garage in Menlo Park, California.
The total initial investment raised for the new company amounted to almost US$1.1 million, including a US$100,000 check by Andy Bechtolsheim, one of the founders of Sun Microsystems.
In March 1999, the company moved into offices in Palo Alto, home to several other noted Silicon Valley technology startups.
After quickly outgrowing two other sites, the company leased a complex of buildings in Mountain View at 1600 Amphitheatre Parkway from Silicon Graphics (SGI) in 2003.
The company has remained at this location ever since, and the complex has since come to be known as the Googleplex (a play on the word googolplex).
In 2006, Google bought the property from SGI for US$319 million.
The Google search engine attracted a loyal following among the growing number of Internet users, who liked its simple design and usability.
In 2000, Google began selling advertisements associated with search keywords.
The ads were text-based to maintain an uncluttered page design and to maximize page loading speed.
Keywords were sold based on a combination of price bid and clickthroughs, with bidding starting at US$.05 per click.
This model of selling keyword advertising was pioneered by Goto.com (later renamed Overture Services, before being acquired by Yahoo! and rebranded as Yahoo! Search Marketing).
While many of its dot-com rivals failed in the new Internet marketplace, Google quietly rose in stature while generating revenue.
The name "Google" originated from a common misspelling of the word "googol", which refers to 10100, the number represented by a 1 followed by one hundred zeros.
Having found its way increasingly into everyday language, the verb "google", was added to the Merriam Webster Collegiate Dictionary and the Oxford English Dictionary in 2006, meaning "to use the Google search engine to obtain information on the Internet."
A patent describing part of Google's ranking mechanism (PageRank) was granted on September 4, 2001.
The patent was officially assigned to Stanford University and lists Lawrence Page as the inventor.
Financing and initial public offering
The first funding for Google as a company was secured in 1998, in the form of a US$100,000 contribution from Andy Bechtolsheim, co-founder of Sun Microsystems, given to a corporation which did not yet exist.
Around six months later, a much larger round of funding was announced, with the major investors being rival venture capital firms Kleiner Perkins Caufield & Byers and Sequoia Capital.
Google's IPO took place on August 19, 2004.
19,605,052 shares were offered at a price of US$85 per share.
Of that, 14,142,135 (another mathematical reference as √2 ≈ 1.4142135) were floated by Google, and the remaining 5,462,917 were offered by existing stockholders.
The sale of US$1.67 billion gave Google a market capitalization of more than US$23 billion.
The vast majority of Google's 271 million shares remained under Google's control.
Many of Google's employees became instant paper millionaires.
Yahoo!, a competitor of Google, also benefited from the IPO because it owned 8.4 million shares of Google as of August 9, 2004, ten days before the IPO.
Google's stock performance after its first IPO launch has gone well, with shares hitting US$700 for the first time on October 31, 2007, due to strong sales and earnings in the advertising market, as well as the release of new features such as the desktop search function and its iGoogle personalized home page.
The surge in stock price is fueled primarily by individual investors, as opposed to large institutional investors and mutual funds.
The company is listed on the NASDAQ stock exchange under the ticker symbol GOOG and under the London Stock Exchange under the ticker symbol GGEA.
Growth
While the company's primary business interest is in the web content arena, Google has begun experimenting with other markets, such as radio and print publications.
On January 17, 2006, Google announced that its purchase of a radio advertising company "dMarc", which provides an automated system that allows companies to advertise on the radio.
This will allow Google to combine two niche advertising media—the Internet and radio—with Google's ability to laser-focus on the tastes of consumers.
Google has also begun an experiment in selling advertisements from its advertisers in offline newspapers and magazines, with select advertisements in the Chicago Sun-Times.
They have been filling unsold space in the newspaper that would have normally been used for in-house advertisements.
Google was added to the S&P 500 index on March 30, 2006.
It replaced Burlington Resources, a major oil producer based in Houston which was acquired by ConocoPhillips.
Acquisitions
Since 2001, Google has acquired several small start-up companies, often consisting of innovative teams and products.
One of the earlier companies that Google bought was Pyra Labs.
They were the creators of Blogger, a weblog publishing platform, first launched in 1999.
This acquisition led to many premium features becoming free.
Pyra Labs was originally formed by Evan Williams, yet he left Google in 2004.
In early 2006, Google acquired Upstartle, a company responsible for the online word processor, Writely.
The technology in this product was used by Google to eventually create Google Docs & Spreadsheets.
In 2004, Google acquired a company called Keyhole, Inc., which developed a product called Earth Viewer which was renamed in 2005 to Google Earth.
In February 2006, software company Adaptive Path sold Measure Map, a weblog statistics application, to Google.
Registration to the service has since been temporarily disabled.
The last update regarding the future of Measure Map was made on April 6, 2006 and outlined many of the service's known issues.
In late 2006, Google bought online video site YouTube for US$1.65 billion in stock.
Shortly after, on October 31, 2006, Google announced that it had also acquired JotSpot, a developer of wiki technology for collaborative Web sites.
On April 13, 2007, Google reached an agreement to acquire DoubleClick.
Google agreed to buy the company for US$3.1 billion.
On July 9, 2007, Google announced that it had signed a definitive agreement to acquire enterprise messaging security and compliance company Postini.
Partnerships
In 2005, Google entered into partnerships with other companies and government agencies to improve production and services.
Google announced a partnership with NASA Ames Research Center to build up of offices and work on research projects involving large-scale data management, nanotechnology, distributed computing, and the entrepreneurial space industry.
Google also entered into a partnership with Sun Microsystems in October to help share and distribute each other's technologies.
The company entered into a partnership with Time Warner's AOL, to enhance each other's video search services.
The same year, the company became a major financial investor of the new .mobi top-level domain for mobile devices, in conjunction with several other companies, including Microsoft, Nokia, and Ericsson among others.
In September 2007, Google launched, "Adsense for Mobile", a service for its publishing partners which provides the ability to monetize their mobile websites through the targeted placement of mobile text ads, and acquired the mobile social networking site, Zingku.mobi, to "provide people worldwide with direct access to Google applications, and ultimately the information they want and need, right from their mobile devices."
In 2006, Google and News Corp.'s Fox Interactive Media entered into a US$900 million agreement to provide search and advertising on the popular social networking site, MySpace.
On November 5, 2007 Google announced the Open Handset Alliance to develop an open platform for mobile services called Android.
On March,2008 Google, Sprint, Intel, Comcast, Time Warner Cable,Bright House Networks,Clearwire together found Xohm to provide wireless telecommunication service.
Products and services
Google has created services and tools for the general public and business environment alike; including Web applications, advertising networks and solutions for businesses.
Advertising
Most of Google's revenue is derived from advertising programs.
For the 2006 fiscal year, the company reported US$10.492 billion in total advertising revenues and only US$112 million in licensing and other revenues.
Google AdWords allows Web advertisers to display advertisements in Google's search results and the Google Content Network, through either a cost-per-click or cost-per-view scheme.
Google AdSense website owners can also display adverts on their own site, and earn money every time ads are clicked.
Web-based software
The Google web search engine is the company's most popular service.
As of August 2007, Google is the most used search engine on the web with a 53.6% market share, ahead of Yahoo! (19.9%) and Live Search (12.9%).
Google indexes billions of Web pages, so that users can search for the information they desire, through the use of keywords and operators.
Google has also employed the Web Search technology into other search services, including Image Search, Google News, the price comparison site Google Product Search, the interactive Usenet archive Google Groups, Google Maps, and more.
In 2004, Google launched its own free web-based e-mail service, known as Gmail (or Google Mail in some jurisdictions).
Gmail features spam-filtering technology and the capability to use Google technology to search e-mail.
The service generates revenue by displaying advertisements and links from the AdWords service that are tailored to the choice of the user and/or content of the e-mail messages displayed on screen.
In early 2006, the company launched Google Video, which not only allows users to search and view freely available videos but also offers users and media publishers the ability to publish their content, including television shows on CBS, NBA basketball games, and music videos.
In August 2007, Google announced that it would shut down its video rental and sale program and offer refunds and Google Checkout credits to consumers who had purchased videos to own.
On February 28, 2008 Google launched the Google Sites wiki as a Google Apps component.
Google has also developed several desktop applications, including Google Earth, an interactive mapping program powered by satellite and aerial imagery that covers the vast majority of the planet.
Google Earth is generally considered to be remarkably accurate and extremely detailed.
Many major cities have such detailed images that one can zoom in close enough to see vehicles and pedestrians clearly.
Consequently, there have been some concerns about national security implications.
Specifically, some countries and militaries contend the software can be used to pinpoint with near-precision accuracy the physical location of critical infrastructure, commercial and residential buildings, bases, government agencies, and so on.
However, the satellite images are not necessarily frequently updated, and all of them are available at no charge through other products and even government sources.
For example, NASA and the National Geospatial-Intelligence Agency.
Some counter this argument by stating that Google Earth makes it easier to access and research the images.
Many other products are available through Google Labs, which is a collection of incomplete applications that are still being tested for use by the general public.
Google has promoted their products in various ways.
In London, Google Space was set-up in Heathrow Airport, showcasing several products, including Gmail, Google Earth and Picasa.
Also, a similar page was launched for American college students, under the name College Life, Powered by Google.
In 2007, some reports surfaced that Google was planning the release of its own mobile phone, possibly a competitor to Apple's iPhone.
The project, called Android provides a standard development kit that will allow any "Android" phone to run software developed for the Android SDK, no matter the phone manufacturer.
In October 2007, Google SMS service was launched in India allowing users to get business listings, movie showtimes, and information by sending an SMS.
Enterprise products
In 2007, Google launched Google Apps Premier Edition, a version of Google Apps targeted primarily at the business user.
It includes such extras as more disk space for e-mail, API access, and premium support, for a price of US$50 per user per year.
A large implementation of Google Apps with 38,000 users is at Lakehead University in Thunder Bay, Ontario, Canada.
Platform
Google runs its services on several server farms, each comprising thousands of low-cost commodity computers running stripped-down versions of Linux.
While the company divulges no details of its hardware, a 2006 estimate cites 450,000 servers, "racked up in clusters at data centers around the world."
Corporate affairs and culture
Google is known for its relaxed corporate culture, of which its playful variations on its own corporate logo are an indicator.
In 2007 and 2008, Fortune Magazine placed Google at the top of its list of the hundred best places to work.
Google's corporate philosophy embodies such casual principles as "you can make money without doing evil," "you can be serious without a suit," and "work should be challenging and the challenge should be fun."
Google has been criticized for having salaries below industry standards.
For example, some system administrators earn no more than US$35,000 per year – considered to be quite low for the Bay Area job market.
However, Google's stock performance following its IPO has enabled many early employees to be competitively compensated by participation in the corporation's remarkable equity growth.
Google implemented other employee incentives in 2005, such as the Google Founders' Award, in addition to offering higher salaries to new employees.
Google's workplace amenities, culture, global popularity, and strong brand recognition have also attracted potential applicants.
After the company's IPO in August 2004, it was reported that founders Sergey Brin and Larry Page, and CEO Eric Schmidt, requested that their base salary be cut to US$1.00.
Subsequent offers by the company to increase their salaries have been turned down, primarily because, "their primary compensation continues to come from returns on their ownership stakes in Google.
As significant stockholders, their personal wealth is tied directly to sustained stock price appreciation and performance, which provides direct alignment with stockholder interests."
Prior to 2004, Schmidt was making US$250,000 per year, and Page and Brin each earned a salary of US$150,000.
They have all declined recent offers of bonuses and increases in compensation by Google's board of directors.
In a 2007 report of the United States' richest people, Forbes reported that Sergey Brin and Larry Page were tied for #5 with a net worth of US$18.5 billion each.
In 2007 and through early 2008, Google has seen the departure of several top executives.
Justin Rosenstein, Google’s product manager, left in June of 2007.
Shortly thereafter, Gideon Yu, former chief financial officer of YouTube, a Google unit, joined Facebook along with Benjamin Ling, a high-ranking engineer, who left in October 2007.
In March 2008, two senior Google leaders announced their desire to pursue other opportunities.
Sheryl Sandburg, ex-VP of global online sales and operations began her position as COO of Facebook while Ash ElDifrawi, former head of brand advertising, left to become CMO of Netshops Inc.
Googleplex
Google's headquarters in Mountain View, California, is referred to as "the Googleplex" in a play of words; a googolplex being 1 followed by a googol of zeros, and the HQ being a complex of buildings (cf. multiplex, cineplex, etc).
The lobby is decorated with a piano, lava lamps, old server clusters, and a projection of search queries on the wall.
The hallways are full of exercise balls and bicycles.
Each employee has access to the corporate recreation center.
Recreational amenities are scattered throughout the campus and include a workout room with weights and rowing machines, locker rooms, washers and dryers, a massage room, assorted video games, Foosball, a baby grand piano, a pool table, and ping pong.
In addition to the rec room, there are snack rooms stocked with various foods and drinks.
In 2006, Google moved into of office space in New York City, at 111 Eighth Ave. in Manhattan.
The office was specially designed and built for Google and houses its largest advertising sales team, which has been instrumental in securing large partnerships, most recently deals with MySpace and AOL.
In 2003, they added an engineering staff in New York City, which has been responsible for more than 100 engineering projects, including Google Maps, Google Spreadsheets, and others.
It is estimated that the building costs Google US$10 million per year to rent and is similar in design and functionality to its Mountain View headquarters, including foosball, air hockey, and ping-pong tables, as well as a video game area.
In November 2006, Google opened offices on Carnegie Mellon's campus in Pittsburgh.
By late 2006, Google also established a new headquarters for its AdWords division in Ann Arbor, Michigan.
The size of Google's search system is presently undisclosed.
The best estimates place the total number of the company's servers at 450,000, spread over twenty five locations throughout the world, including major operations centers in Dublin (European Operations Headquarters) and Atlanta, Georgia.
Google is also in the process of constructing a major operations center in The Dalles, Oregon, on the banks of the Columbia River.
The site, also referred to by the media as Project 02, was chosen due to the availability of inexpensive hydroelectric power and a large surplus of fiber optic cable, remnants of the dot com boom of the late 1990s.
The computing center is estimated to be the size of two football fields, and it has created hundreds of construction jobs, causing local real estate prices to increase 40%.
Upon completion, the center is expected to create 60 to 200 permanent jobs in the town of 12,000 people.
Google is taking steps to ensure that their operations are environmentally sound.
In October 2006, the company announced plans to install thousands of solar panels to provide up to 1.6 megawatts of electricity, enough to satisfy approximately 30% of the campus' energy needs.
The system will be the largest solar power system constructed on a U.S. corporate campus and one of the largest on any corporate site in the world.
In June 2007, Google announced that they plan to become carbon neutral by 2008, which includes investing in energy efficiency, renewable energy sources, and purchasing carbon offsets, such as investing in projects like capturing and burning methane from animal waste at Mexican and Brazilian farms.
Innovation time off
As an interesting motivation technique (usually called Innovation Time Off), all Google engineers are encouraged to spend 20% of their work time (one day per week) on projects that interest them.
Some of Google's newer services, such as Gmail, Google News, Orkut, and AdSense originated from these independent endeavors.
In a talk at Stanford University, Marissa Mayer, Google's Vice President of Search Products and User Experience, stated that her analysis showed that half of the new product launches originated from the 20% time.
Easter eggs and April Fool's Day jokes
Google has a tradition of creating April Fool's Day jokes—such as Google MentalPlex, which allegedly featured the use of mental power to search the web.
In 2002, they claimed that pigeons were the secret behind their growing search engine.
In 2004, they featured Google Lunar (which claimed to feature jobs on the moon), and in 2005, a fictitious brain-boosting drink, termed Google Gulp was announced.
In 2006, they came up with Google Romance, a hypothetical online dating service.
In 2007, Google announced two joke products.
The first was a free wireless Internet service called TiSP (Toilet Internet Service Provider) in which one obtained a connection by flushing one end of a fiber-optic cable down their toilet and waiting only an hour for a "Plumbing Hardware Dispatcher (PHD)" to connect it to the Internet.
Additionally, Google's Gmail page displayed an announcement for Gmail Paper, which allows users of their free email service to have email messages printed and shipped to a snail mail address.
Google's services contain a number of Easter eggs; for instance, the Language Tools page offers the search interface in the Swedish Chef's "Bork bork bork," Pig Latin, ”Hacker” (actually leetspeak), Elmer Fudd, and Klingon.
In addition, the search engine calculator provides the Answer to Life, the Universe, and Everything from Douglas Adams' The Hitchhiker's Guide to the Galaxy.
As Google's search box can be used as a unit converter (as well as a calculator), some non-standard units are built in, such as the Smoot.
Google also routinely modifies its logo in accordance with various holidays or special events throughout the year, such as Christmas, Mother's Day, or the birthdays of various notable individuals.
IPO and culture
Many people speculated that Google's IPO would inevitably lead to changes in the company's culture, because of shareholder pressure for employee benefit reductions and short-term advances, or because a large number of the company's employees would suddenly become millionaires on paper.
In a report given to potential investors, co-founders Sergey Brin and Larry Page promised that the IPO would not change the company's culture.
Later Mr. Page said, "We think a lot about how to maintain our culture and the fun elements.
We spent a lot of time getting our offices right.
We think it's important to have a high density of people.
People are packed together everywhere.
We all share offices.
We like this set of buildings because it's more like a densely packed university campus than a typical suburban office park."
However, many analysts are finding that as Google grows, the company is becoming more "corporate".
In 2005, articles in The New York Times and other sources began suggesting that Google had lost its anti-corporate, no evil philosophy.
In an effort to maintain the company's unique culture, Google has designated a Chief Culture Officer in 2006, who also serves as the Director of Human Resources.
The purpose of the Chief Culture Officer is to develop and maintain the culture and work on ways to keep true to the core values that the company was founded on in the beginning—a flat organization, a lack of hierarchy, a collaborative environment.
Philanthropy
In 2004, Google formed a for-profit philanthropic wing, Google.org, with a start-up fund of US$1 billion.
The express mission of the organization is to create awareness about climate change, global public health, and global poverty.
One of its first projects is to develop a viable plug-in hybrid electric vehicle that can attain 100 mpg.
The founding and current director is Dr. Larry Brilliant.
Criticism
As it has grown, Google has found itself the focus of several controversies related to its business practices and services.
For example, Google Book Search's effort to digitize millions of books and make the full text searchable has led to copyright disputes with the Authors Guild.
Google's cooperation with the governments of China, and to a lesser extent France and Germany (regarding Holocaust denial) to filter search results in accordance to regional laws and regulations has led to claims of censorship.
Google's persistent cookie and other information collection practices have led to concerns over user privacy.
As of December 11, 2007, Google, like the Microsoft search engine, stores "personal information for 18 months" and by comparison, Yahoo! and AOL (Time Warner) "retain search requests for 13 months."
A number of Indian state governments have raised concerns about the security risks posed by geographic details provided by Google Earth's satellite imaging.
Google has also been criticized by advertisers regarding its inability to combat click fraud, when a person or automated script is used to generate a charge on an advertisement without really having an interest in the product.
Industry reports in 2006 claim that approximately 14 to 20 percent of clicks were in fact fraudulent or invalid.
Further, Google has faced allegations of sexism and ageism from former employees.
Google has also faced accusations in Harper's Magazine of being extremely excessive with their energy usage, and were accused of employing their "Don't be evil" motto as well as their very public energy saving campaigns as means of trying to cover up or make up for the massive amounts of energy their servers actually require.
Also, US District Court Judge Louis Stanton, on July 1, 2008 ordered Google to give YouTube user data / log to Viacom to support its case in a billion-dollar copyright lawsuit against Google.
Google and Viacom, however, on July 14, 2008, agreed in compromise to protect YouTube users' personal data in the $ 1 billion (£ 497 million) copyright lawsuit.
Google agreed it will make user information and internet protocol addresses from its YouTube subsidiary anonymous before handing over the data to Viacom.
The privacy deal also applied to other litigants including the FA Premier League, the Rodgers & Hammerstein Organisation and the Scottish Premier League.
The deal however did not extend the anonymity to employees, since Viacom would prove that Google staff are aware of uploading of illegal material to the site.
The parties therefore will further meet on the matter lest the data be made available to the court.
Google Translate
Google Translate is a service provided by Google Inc. to translate a section of text, or a webpage, into another language, with limits to the number of paragraphs, or range of technical terms, translated.
For some languages, users are asked for alternate translations, such as for technical terms, to be included for future updates to the translation process.
Unlike other translation services such as Babel Fish, AOL, and Yahoo which use SYSTRAN, Google uses its own translation software.
Functions
The service also includes translation of an entire Web page.
The translation is limited in number of paragraphs per webpage (such as indicated by break-tags
); however, if text on a webpage is separated by horizontal blank-line images (auto-wrapped without using any
), a long webpage can be translated containing several thousand words.
Google Translate, like other automatic translation tools, has its limitations.
While it can help the reader to understand the general content of a foreign language text, it does not deliver accurate translations and does not produce publication-standard content, for example it often translates words out of context and is deliberately not applying any grammatical rules.
Approach
Google translate is based on an approach called statistical machine translation, and more specifically, on research by Franz-Josef Och who won the DARPA contest for speed machine translation in 2003.
Och is now the head of Google's machine translation department.
According to Och, a solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words.
Statistical models from this data are then used to translate between those languages.
To acquire this huge amount of linguistic data, Google used United Nations documents.
The same document is normally available in all six official UN languages, thus Google now has a hectalingual corpus of 20 billion words' worth of human translations.
The availability of Arabic and Chinese as official UN languages is probably one of the reasons why Google Translate initially focused on the development of translation between English and those languages, and not, for example, Japanese and German, which are not official languages at the UN.
Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.
Options
(by chronological order)
Beginning
English to Arabic
English to French
English to German
English to Spanish
French to English
German to English
Spanish to English
Arabic to English
2nd stage
English to Portuguese
Portuguese to English
3rd stage
English to Italian
Italian to English
4th stage
English to Chinese (Simplified) BETA
English to Japanese BETA
English to Korean BETA
Chinese (Simplified) to English BETA
Japanese to English BETA
Korean to English BETA
5th stage
English to Russian BETA
Russian to English BETA
6th stage
English to Arabic BETA
Arabic to English BETA
7th stage (launched February, 2007)
English to Chinese (Traditional) BETA
Chinese (Traditional) to English BETA
Chinese (Simplified to Traditional) BETA
Chinese (Traditional to Simplified) BETA
8th stage (launched October, 2007)
all 25 language pairs use Google's machine translation system
9th stage
English to Hindi BETA
Hindi to English BETA
10th stage (as of this stage, translation can be done between any two languages)
Bulgarian
Croatian
Czech
Danish
Dutch
Finnish
Greek
Norwegian
Polish
Romanian
Swedish
Grammar
Grammar is the field of linguistics that covers the rules governing the use of any given natural language.
It includes morphology and syntax, often complemented by phonetics, phonology, semantics, and pragmatics.
Each language has its own distinct grammar.
"English grammar" is the rules of the English language itself.
"An English grammar" is a specific study or analysis of these rules.
A reference book describing the grammar of a language is called a "reference grammar" or simply "a grammar".
A fully explicit grammar exhaustively describing the grammatical constructions of a language is called a descriptive grammar, as opposed to linguistic prescription which tries to enforce the governing rules how a language is to be used.
Grammatical frameworks are approaches to constructing grammars.
The standard framework of generative grammar is the transformational grammar model developed by Noam Chomsky and his followers from the 1950s to 1980s.
Etymology
The word "grammar," derives from Greek γραμματική τέχνη (grammatike techne), which means "art of letters," from γράμμα (gramma), "letter," and that from γράφειν (graphein), "to draw, to write".
History
The first systematic grammars originate in Iron Age India, with Panini (4th c. BC) and his commentators Pingala (ca. 200 BC), Katyayana, and Patanjali (2nd c. BC).
In the West, grammar emerges as a discipline in Hellenism from the 3rd c. BC forward with authors like Rhyanus and Aristarchus of Samothrace, the oldest extant work being the Art of Grammar (), attributed to Dionysius Thrax (ca. 100 BC).
Latin grammar developed by following Greek models from the 1st century BC, due to the work of authors such as Orbilius Pupillus, Remmius Palaemon, Marcus Valerius Probus, Verrius Flaccus, Aemilius Asper.
Tamil grammatical tradition also began around the 1st century BC with the Tolkāppiyam.
A grammar of Irish originated in the 7th century with the Auraicept na n-Éces.
Arabic grammar emerges from the 8th century with the work of Ibn Abi Ishaq and his students.
The first treatises on Hebrew grammar appear in the High Middle Ages, in the context of Mishnah (exegesis of the Hebrew Bible).
The Karaite tradition originates in Abbasid Baghdad.
The Diqduq (10th century) is one of the earliest grammatical commentaries on the Hebrew Bible.
Ibn Barun in the 12th century compares the Hebrew language with Arabic in the Islamic grammatical tradition.
Belonging to the trivium of the seven liberal arts, grammar was taught as a core discipline throughout the Middle Ages, following the influence of authors from Late Antiquity, such as Priscian.
Treatment of vernaculars begins gradually during the High Middle Ages, with isolated works such as the First Grammatical Treatise, but becomes influential only in the Renaissance and Baroque periods.
In 1486, Antonio de Nebrija published Las introduciones Latinas contrapuesto el romance al Latin, and the first Spanish grammar, Gramática de la lengua castellana, in 1492.
During the 16th century Italian Renaissance, the Questione della lingua was the discussion on the status and ideal form of the Italian language, initiated by Dante's de vulgari eloquentia (Pietro Bembo, Prose della volgar lingua Venice 1525).
Grammars of non-European languages began to be compiled for the purposes of evangelization and Bible translation from the 16th century onward, such as Grammatica o Arte de la Lengua General de los Indios de los Reynos del Perú (1560), and a Quechua grammar by Fray Domingo de Santo Tomás.
In 1643 there appeared Ivan Uzhevych's Grammatica sclavonica and, in 1762, the Short Introduction to English Grammar of Robert Lowth was also published.
The Grammatisch-Kritisches Wörterbuch der hochdeutschen Mundart, a High German grammar in five volumes by Johann Christoph Adelung, appeared as early as 1774.
From the latter part of the 18th century, grammar came to be understood as a subfield of the emerging discipline of modern linguistics.
The Serbian grammar by Vuk Stefanović Karadžić arrived in 1814, while the Deutsche Grammatik of the Brothers Grimm was first published in 1818.
The Comparative Grammar of Franz Bopp, the starting point of modern comparative linguistics, came out in 1833.
In the USA, the Society for the Promotion of Good Grammar has designated March 4, 2008 as National Grammar Day.
Development of grammars
Grammars evolve through usage, and grammars also develop due to separations of the human population.
With the advent of written representations, formal rules about language usage tend to appear also.
Formal grammars are codifications of usage that are developed by repeated documentation over time, and by observation as well.
As the rules become established and developed, the prescriptive concept of grammatical correctness can arise.
This often creates a discrepancy between contemporary usage and that which has been accepted over time as being correct.
Linguists tend to believe that prescriptive grammars do not have any justification beyond their authors' aesthetic tastes; however, prescriptions are considered in sociolinguistics as part of the explanation for why some people say "I didn't do nothing", some say "I didn't do anything", and some say one or the other depending on social context.
The formal study of grammar is an important part of education for children from a young age through advanced learning, though the rules taught in schools are not a "grammar" in the sense most linguists use the term, as they are often prescriptive rather than descriptive.
Constructed languages (also called planned languages or conlangs) are more common in the modern day.
Many have been designed to aid human communication (for example, naturalistic Interlingua, schematic Esperanto, and the highly logic-compatible artificial language Lojban).
Each of these languages has its own grammar.
No clear line can be drawn between syntax and morphology.
Analytic languages use syntax to convey information that is encoded via inflection in synthetic languages.
In other words, word order is not significant and morphology is highly significant in a purely synthetic language, whereas morphology is not significant and syntax is highly significant in an analytic language.
Chinese and Afrikaans, for example, are highly analytic, and meaning is therefore very context – dependent.
(Both do have some inflections, and have had more in the past; thus, they are becoming even less synthetic and more "purely" analytic over time.)
Latin, which is highly synthetic, uses affixes and inflections to convey the same information that Chinese does with syntax.
Because Latin words are quite (though not completely) self-contained, an intelligible Latin sentence can be made from elements that are placed in a largely arbitrary order.
Latin has a complex affixation and a simple syntax, while Chinese has the opposite.
Grammar frameworks
Various "grammar frameworks" have been developed in theoretical linguistics since the mid 20th century, in particular under the influence of the idea of a "Universal grammar" in the USA.
Of these, the main divisions are:
Transformational grammar (TG))
Principles and Parameters Theory (P&P)
Lexical-functional Grammar (LFG)
Generalized Phrase Structure Grammar (GPSG)
Head-Driven Phrase Structure Grammar (HPSG)
Dependency grammars (DG)
Role and reference grammar (RRG)
Hidden Markov model
A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters.
The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications.
An HMM can be considered as the simplest dynamic Bayesian network.
In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters.
In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible.
Each state has a probability distribution over the possible output tokens.
Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.
Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, partial discharges and bioinformatics.
Architecture of a hidden Markov model
The diagram below shows the general architecture of an instantiated HMM.
Each oval shape represents a random variable that can adopt a number of values.
The random variable is the hidden state at time (with the model from the above diagram, ).
The random variable is the observation at time ().
The arrows in the diagram (often called a trellis diagram) denote conditional dependencies.
From the diagram, it is clear that the value of the hidden variable (at time ) only depends on the value of the hidden variable : the values at time and before have no influence.
This is called the Markov property.
Similarly, the value of the observed variable only depends on the value of the hidden variable (both at time ).
Probability of an observed sequence
The probability of observing a sequence of length is given by
where the sum runs over all possible hidden node sequences .
Brute force calculation of is intractable for most real-life problems, as the number of possible hidden node sequences is typically extremely high.
The calculation can however be sped up enormously using the forward algorithm or the equivalent backward algorithm.
Using hidden Markov models
There are three canonical problems associated with HMM:
Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence.
This problem is solved by the forward-backward algorithm.
Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence.
This problem is solved by the Viterbi algorithm.
Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities.
In other words, discover the parameters of the HMM given a dataset of sequences.
This problem is solved by the Baum-Welch algorithm.
A concrete example
This example is further elaborated in the Viterbi algorithm page.
Applications of hidden Markov models
Cryptanalysis
Speech recognition
Machine translation
Partial discharge
History
Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s.
One of the first applications of HMMs was speech recognition, starting in the mid-1970s.
In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA.
Since then, they have become ubiquitous in the field of bioinformatics.
HTML
HTML, an initialism of HyperText Markup Language, is the predominant markup language for web pages.
It provides a means to describe the structure of text-based information in a document — by denoting certain text as links, headings, paragraphs, lists, and so on — and to supplement that text with interactive forms, embedded images, and other objects.
HTML is written in the form of tags, surrounded by angle brackets.
HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code (such as JavaScript) which can affect the behavior of Web browsers and other HTML processors.
HTML is also often used to refer to content in specific languages, such as a MIME type text/html, or even more broadly as a generic term for HTML, whether in its XML-descended form (such as XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01 and earlier).
By convention, HTML format data files use a file extension .html or .htm.
History of HTML
Origins
In 1980, physicist Tim Berners-Lee, who was an independent contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents.
In 1989, Berners-Lee and CERN data systems engineer Robert Cailliau each submitted separate proposals for an Internet-based hypertext system providing similar functionality.
The following year, they collaborated on a joint proposal, the WorldWideWeb (W3) project, which was accepted by CERN.
First specifications
The first publicly available description of HTML was a document called HTML Tags, first mentioned on the Internet by Berners-Lee in late 1991.
It describes 22 elements comprising the initial, relatively simple design of HTML.
Thirteen of these elements still exist in HTML 4.
Berners-Lee considered HTML to be, at the time, an application of SGML, but it was not formally defined as such until the mid-1993 publication, by the IETF, of the first proposal for an HTML specification: Berners-Lee and Dan Connolly's "Hypertext Markup Language (HTML)" Internet-Draft, which included an SGML Document Type Definition to define the grammar.
The draft expired after six months, but was notable for its acknowledgment of the NCSA Mosaic browser's custom tag for embedding in-line images, reflecting the IETF's philosophy of basing standards on successful prototypes.
Similarly, Dave Raggett's competing Internet-Draft, "HTML+ (Hypertext Markup Format)", from late 1993, suggested standardizing already-implemented features like tables and fill-out forms.
After the HTML and HTML+ drafts expired in early 1994, the IETF created an HTML Working Group, which in 1995 completed "HTML 2.0", the first HTML specification intended to be treated as a standard against which future implementations should be based.
Published as Request for Comments 1996, HTML 2.0 included ideas from the HTML and HTML+ drafts.
There was no "HTML 1.0"; the 2.0 designation was intended to distinguish the new edition from previous drafts.
Further development under the auspices of the IETF was stalled by competing interests.
Since 1996, the HTML specifications have been maintained, with input from commercial software vendors, by the World Wide Web Consortium (W3C).
However, in 2000, HTML also became an international standard (ISO/IEC 15445:2000).
The last HTML specification published by the W3C is the HTML 4.01 Recommendation, published in late 1999.
Its issues and errors were last acknowledged by errata published in 2001.
Version history of the standard
HTML versions
July, 1993: Hypertext Markup Language, was published at IETF working draft (that is, not yet a standard).
November, 1995: HTML 2.0 published as IETF Request for Comments:
RFC 1866,
supplemented by RFC 1867 (form-based file upload) that same month,
RFC 1942 (tables) in May 1996,
RFC 1980 (client-side image maps) in August 1996, and
RFC 2070 (internationalization) in January 1997;
Ultimately, all were declared obsolete/historic by RFC 2854 in June 2000.
April 1995: HTML 3.0, proposed as a standard to the IETF.
It included many of the capabilities that were in Raggett's HTML+ proposal, such as support for tables, text flow around figures, and the display of complex mathematical formulas.
A demonstration appeared in W3C's own Arena browser.
HTML 3.0 did not succeed for several reasons.
The pace of browser development, as well as the number of interested parties, had outstripped the resources of the IETF.
Netscape continued to introduce HTML elements that specified the visual appearance of documents, contrary to the goals of the newly-formed W3C, which sought to limit HTML to describing logical structure.
Microsoft, a newcomer at the time, played to all sides by creating its own tags, implementing Netscape's elements for compatibility, and supporting W3C features such as Cascading Style Sheets.
January 14, 1997: HTML 3.2, published as a W3C Recommendation.
It was the first version developed and standardized exclusively by the W3C, as the IETF had closed its HTML Working Group in September 1997.
The new version dropped math formulas entirely, reconciled overlap among various proprietary extensions, and adopted most of Netscape's visual markup tags.
Netscape's blink element and Microsoft's marquee element were omitted due to a mutual agreement between the two companies.
The ability to include mathematical formulas in HTML would not be standardized until years later in MathML.
December 18, 1997: HTML 4.0, published as a W3C Recommendation.
It offers three "flavors":
Strict, in which deprecated elements are forbidden,
Transitional, in which deprecated elements are allowed,
Frameset, in which mostly only frame related elements are allowed;
HTML 4.0 (initially code-named "Cougar") likewise adopted many browser-specific element types and attributes, but at the same time sought to phase out Netscape's visual markup features by marking them as deprecated in favor of style sheets.
Minor editorial revisions to the HTML 4.0 specification were published in 1998 without incrementing the version number and further minor revisions as HTML 4.01.
April 24, 1998: HTML 4.0 was reissued with minor edits without incrementing the version number.
December 24, 1999: HTML 4.01, published as a W3C Recommendation.
It offers the same three flavors as HTML 4.0, and its last errata were published May 12, 2001.
HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML.
May 15, 2000: [https://www.cs.tcd.ie/15445/15445.HTML ISO/IEC 15445:2000] ("ISO HTML", based on HTML 4.01 Strict), published as an ISO/IEC international standard.
January 22, 2008: HTML 5, published as a Working Draft by W3C.
XHTML versions
XHTML is a separate language that began as a reformulation of HTML 4.01 using XML 1.0.
It continues to be developed:
XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and republished August 1, 2002.
It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions.
XHTML 1.1, published May 31, 2001 as a W3C Recommendation.
It is based on XHTML 1.0 Strict, but includes minor changes, can be customized, and is reformulated using modules from Modularization of XHTML, which was published April 10, 2001 as a W3C Recommendation.
XHTML 2.0 is still a W3C Working Draft.
XHTML 2.0 is incompatible with XHTML 1.x and, therefore, would be more accurate to characterize as an XHTML-inspired new language than an update to XHTML 1.x.
XHTML 5, which is an update to XHTML 1.x, is being defined alongside HTML 5 in the HTML 5 draft.
HTML markup
HTML markup consists of several key components, including elements (and their attributes), character-based data types, and character references and entity references.
Another important component is the document type declaration.
HTML Hello World:
Elements
See HTML elements for more detailed descriptions.
Elements are the basic structure for HTML markup.
Elements have two basic properties: attributes and content.
Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid.
An element usually has a start tag (e.g.
) and an end tag (e.g.
).
The element's attributes are contained in the start tag and content is located between the tags (e.g.
).
Some elements, such as
, do not have any content and must not have a closing tag.
Listed below are several types of markup elements used in HTML.
Structural markup describes the purpose of text.
For example,
establishes "Golf" as a second-level heading, which would be rendered in a browser in a manner similar to the "HTML markup" title at the start of this section.
Structural markup does not denote any specific rendering, but most Web browsers have standardized on how elements should be formatted.
Text may be further styled with Cascading Style Sheets (CSS).
Presentational markup describes the appearance of the text, regardless of its function.
For example
indicates that visual output devices should render "boldface" in bold text, but gives no indication what devices which are unable to do this (such as aural devices that read the text aloud) should do.
In the case of both
and
, there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely
and
respectively.
It is easier to see how an aural user agent should interpret the latter two elements.
However, they are not equivalent to their presentational counterparts: it would be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen such a name would be italicized.
Most presentational markup elements have become deprecated under the HTML 4.0 specification, in favor of CSS based style design.
Hypertext markup links parts of the document to other documents.
HTML up through version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the flow of text:
.
However, the
attribute must also be set to a valid URL so for example the HTML code,
, will render the word "Wikipedia" as a hyperlink.
To link on an image, the anchor tag use the following syntax:
Attributes
Most of the attributes of an element are name-value pairs, separated by "=", and written within the start tag of an element, after the element's name.
The value may be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML).
Leaving attribute values unquoted is considered unsafe.
In contrast with name-value pair attributes, there are some attributes that affect the element simply by their presence in the start tag of the element (like the
attribute for the
element).
Most elements can take any of several common attributes:
The
attribute provides a document-wide unique identifier for an element.
This can be used by stylesheets to provide presentational properties, by browsers to focus attention on the specific element, or by scripts to alter the contents or presentation of an element.
The
attribute provides a way of classifying similar elements for presentation purposes.
For example, an HTML document might use the designation
to indicate that all elements with this class value are subordinate to the main text of the document.
Such elements might be gathered together and presented as footnotes on a page instead of appearing in the place where they occur in the HTML source.
An author may use the
non-attributal codes presentational properties to a particular element.
It is considered better practice to use an element’s son-
page and select the element with a stylesheet, though sometimes this can be too cumbersome for a simple ad hoc application of styled properties.
The
attribute is used to attach subtextual explanation to an element.
In most browsers this attribute is displayed as what is often referred to as a tooltip.
The generic inline element
can be used to demonstrate these various attributes:
This example displays as HTML; in most browsers, pointing the cursor at the abbreviation should display the title text "Hypertext Markup Language."
Most elements also take the language-related attributes
and
.
Character and entity references
As of version 4.0, HTML defines a set of 252 character entity references and a set of 1,114,050 numeric character references, both of which allow individual characters to be written via simple markup, rather than literally.
A literal character and its markup counterpart are considered equivalent and are rendered identically.
The ability to "escape" characters in this way allows for the characters
and
(when written as
and
, respectively) to be interpreted as character data, rather than markup.
For example, a literal
normally indicates the start of a tag, and
normally indicates the start of a character entity reference or numeric character reference; writing it as
or
or
allows
to be included in the content of elements or the values of attributes.
The double-quote character (
), when used to quote an attribute value, must also be escaped as
or
or
when it appears within the attribute value itself.
The single-quote character (
), when used to quote an attribute value, must also be escaped as
or
(should NOT be escaped as
except in XHTML documents) when it appears within the attribute value itself.
However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent.
Escaping also allows for characters that are not easily typed or that aren't even available in the document's character encoding to be represented within the element and attribute content.
For example, the acute-accented
(
), a character typically found only on Western European keyboards, can be written in any HTML document as the entity reference
or as the numeric references
or
.
The characters comprising those references (that is, the
, the
, the letters in
, and so on) are available on all keyboards and are supported in all character encodings, whereas the literal
is not.
Data types
HTML defines several data types for element content, such as script data and stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of length, languages, media descriptors, colors, character encodings, dates and times, and so on.
All of these data types are specializations of character data.
The Document Type Declaration
In order to enable Document Type Definition (DTD)-based validation with SGML tools and in order to avoid the quirks mode in browsers, HTML documents can start with a Document Type Declaration (informally, a "DOCTYPE").
The DTD to which the DOCTYPE refers contains machine-readable grammar specifying the permitted and prohibited content for a document conforming to such a DTD.
Browsers do not necessarily read the DTD, however.
The most popular graphical browsers use DOCTYPE declarations (or the lack thereof) and other data at the beginning of sources to determine which rendering mode to use.
For example:
This declaration references the Strict DTD of HTML 4.01, which does not have presentational elements like
, leaving formatting to Cascading Style Sheets and the
and
tags.
SGML-based validators read the DTD in order to properly parse the document and to perform validation.
In modern browsers, the HTML 4.01 Strict doctype activates standards layout mode for CSS as opposed to quirks mode.
In addition, HTML 4.01 provides Transitional and Frameset DTDs.
The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames.
Semantic HTML
There is no official specification called "Semantic HTML", though the strict flavors of HTML discussed below are a push in that direction.
Rather, semantic HTML refers to an objective and a practice to create documents with HTML that contain only the author's intended meaning, without any reference to how this meaning is presented or conveyed.
A classic example is the distinction between the emphasis element (
) and the italics element (
).
Often the emphasis element is displayed in italics, so the presentation is typically the same.
However, emphasizing something is different from listing the title of a book, for example, which may also be displayed in italics.
In purely semantic HTML, a book title would use a different element than emphasized text uses (for example a
), because they are meaningfully different things.
The goal of semantic HTML requires two things of authors:
To avoid the use of presentational markup (elements, attributes, and other entities).
To use available markup to differentiate the meanings of phrases and structure in the document.
So for example, the book title from above would need to have its own element and class specified, such as
Here, the
element is used because it most closely matches the meaning of this phrase in the text.
However, the
element is not specific enough to this task, since we mean to cite specifically a book title as opposed to a newspaper article or an academic journal.
Semantic HTML also requires complementary specifications and software compliance with these specifications.
Primarily, the development and proliferation of CSS has led to increasing support for semantic HTML, because CSS provides designers with a rich language to alter the presentation of semantic-only documents.
With the development of CSS, the need to include presentational properties in a document has virtually disappeared.
With the advent and refinement of CSS and the increasing support for it in Web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the semantic structure and phrasing of the document, like headings, paragraphs, quotes, and lists, instead of using markup which is written for visual purposes only, like
,
(bold), and
(italics).
Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict.
CSS provides a way to separate document semantics from the content's presentation, by keeping everything relevant to presentation defined in a CSS file.
See separation of style and content.
Semantic HTML offers many advantages.
First, it ensures consistency in style across elements that have the same meaning.
Every heading, every quotation, every similar element receives the same presentation properties.
Second, semantic HTML frees authors from the need to concern themselves with presentation details.
When writing the number two, for example, should it be written out in words ("two"), or should it be written as a numeral (2)?
A semantic markup might enter something like 2 and leave presentation details to the stylesheet designers.
Similarly, an author might wonder where to break out quotations into separate indented blocks of text: with purely semantic HTML, such details would be left up to stylesheet designers.
Authors would simply indicate quotations when they occur in the text, and not concern themselves with presentation.
A third advantage is device independence and repurposing of documents.
A semantic HTML document can be paired with any number of stylesheets to provide output to computer screens (through Web browsers), high-resolution printers, handheld devices, aural browsers or braille devices for those with visual impairments, and so on.
To accomplish this, nothing needs to be changed in a well-coded semantic HTML document.
Readily available stylesheets make this a simple matter of pairing a semantic HTML document with the appropriate stylesheets.
(Of course, the stylesheet's selectors need to match the appropriate properties in the HTML document.)
Some aspects of authoring documents make separating semantics from style (in other words, meaning from presentation) difficult.
Some elements are hybrids, using presentation in their very meaning.
For example, a table displays content in a tabular form.
Often such content conveys the meaning only when presented in this way.
Repurposing a table for an aural device typically involves somehow presenting the table as an inherently visual element in an audible form.
On the other hand, we frequently present lyrical songs—something inherently meant for audible presentation—and instead present them in textual form on a Web page.
For these types of elements, the meaning is not so easily separated from their presentation.
However, for a great many of the elements used and meanings conveyed in HTML, the translation is relatively smooth.
Delivery of HTML
HTML documents can be delivered by the same means as any other computer file; however, they are most often delivered in one of two forms: over HTTP servers and through e-mail.
Publishing HTML with HTTP
The World Wide Web is composed primarily of HTML documents transmitted from a Web server to a Web browser using the Hypertext Transfer Protocol (HTTP).
However, HTTP can be used to serve images, sound, and other content in addition to HTML.
To allow the Web browser to know how to handle the document it received, an indication of the file format of the document must be transmitted along with the document.
This vital metadata includes the MIME type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see Character encodings in HTML).
In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted.
A document sent with an XHTML MIME type, or served as application/xhtml+xml, is expected to be well-formed XML, and a syntax error causes the browser to fail to render the document.
The same document sent with an HTML MIME type, or served as text/html, might be displayed successfully, since Web browsers are more lenient with HTML.
However, XHTML parsed in this way is not considered either proper XHTML or HTML, but so-called tag soup.
If the MIME type is not recognized as HTML, the Web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration.
Nevertheless, some Web browsers do examine the contents or URL of the document and attempt to infer the file type, despite this being forbidden by the HTTP 1.1 specification.
HTML e-mail
Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to provide formatting and semantic markup capabilities not available with plain text, like emphasized text, block quotations for replies, and diagrams or mathematical formulas that could not easily be described otherwise.
Many of these clients include both a GUI editor for composing HTML e-mail messages and a rendering engine for displaying received HTML messages.
Use of HTML in e-mail is controversial because of compatibility issues, because it can be used in phishing/privacy attacks, because it can confuse spam filters, and because the message size is larger than plain text.
Naming conventions
The most common filename extension for files containing HTML is .html.
A common abbreviation of this is .htm; it originates from older operating systems and file systems, such as the DOS versions from the 80s and early 90s and FAT, which limit file extensions to three letters.
Both forms are widely supported by browsers.
Current flavors of HTML
Since its inception, HTML and its associated protocols gained acceptance relatively quickly.
However, no clear standards existed in the early years of the language.
Though its creators originally conceived of HTML as a semantic language devoid of presentation details, practical uses pushed many presentational elements and attributes into the language, driven largely by the various browser vendors.
The latest standards surrounding HTML reflect efforts to overcome the sometimes chaotic development of the language and to create a rational foundation for building both meaningful and well-presented documents.
To return HTML to its role as a semantic language, the W3C has developed style languages such as CSS and XSL to shoulder the burden of presentation.
In conjunction, the HTML specification has slowly reined in the presentational elements.
There are two axes differentiating various flavors of HTML as currently specified: SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis, and strict versus transitional (loose) versus frameset on the other axis.
SGML-based versus XML-based HTML
One difference in the latest HTML specifications lies in the distinction between the SGML-based specification and the XML-based specification.
The XML-based specification is usually called XHTML to distinguish it clearly from the more traditional definition; however, the root element name continues to be 'html' even in the XHTML-specified HTML.
The W3C intended XHTML 1.0 to be identical to HTML 4.01 except where limitations of XML over the more complex SGML require workarounds.
Because XHTML and HTML are closely related, they are sometimes documented in parallel.
In such circumstances, some authors conflate the two names as (X)HTML or X(HTML).
Like HTML 4.01, XHTML 1.0 has three sub-specifications: strict, loose, and frameset.
Aside from the different opening declarations for a document, the differences between an HTML 4.01 and XHTML 1.0 document—in each of the corresponding DTDs—are largely syntactic.
The underlying syntax of HTML allows many shortcuts that XHTML does not, such as elements with optional opening or closing tags, and even EMPTY elements which must not have an end tag.
By contrast, XHTML requires all elements to have an opening tag or a closing tag.
XHTML, however, also introduces a new shortcut: an XHTML tag may be opened and closed within the same tag, by including a slash before the end of the tag like this:
.
The introduction of this shorthand, which is not used in the SGML declaration for HTML 4.01, may confuse earlier software unfamiliar with this new convention.
To understand the subtle differences between HTML and XHTML, consider the transformation of a valid and well-formed XHTML 1.0 document that adheres to Appendix C (see below) into a valid HTML 4.01 document.
To make this translation requires the following steps:
The language for an element should be specified with a
attribute rather than the XHTML
attribute.
XHTML uses XML's built in language-defining functionality attribute.
Remove the XML namespace (
).
HTML has no facilities for namespaces.
Change the document type declaration from XHTML 1.0 to HTML 4.01. (see DTD section for further explanation).
If present, remove the XML declaration.
(Typically this is:
).
Ensure that the document’s MIME type is set to
.
For both HTML and XHTML, this comes from the HTTP
header sent by the server.
Change the XML empty-element syntax to an HTML style empty element (
to
).
Those are the main changes necessary to translate a document from XHTML 1.0 to HTML 4.01.
To translate from HTML to XHTML would also require the addition of any omitted opening or closing tags.
Whether coding in HTML or XHTML it may just be best to always include the optional tags within an HTML document rather than remembering which tags can be omitted.
A well-formed XHTML document adheres to all the syntax requirements of XML.
A valid document adheres to the content specification for XHTML, which describes the document structure.
The W3C recommends several conventions to ensure an easy migration between HTML and XHTML (see HTML Compatibility Guidelines).
The following steps can be applied to XHTML 1.0 documents only:
Include both
and
attributes on any elements assigning language.
Use the empty-element syntax only for elements specified as empty in HTML.
Include an extra space in empty-element tags: for example
instead of
.
Include explicit close tags for elements that permit content but are left empty (for example,
, not
).
Omit the XML declaration.
By carefully following the W3C’s compatibility guidelines, a user agent should be able to interpret the document equally as HTML or XHTML.
For documents that are XHTML 1.0 and have been made compatible in this way, the W3C permits them to be served either as HTML (with a
MIME type), or as XHTML (with an
or
MIME type).
When delivered as XHTML, browsers should use an XML parser, which adheres strictly to the XML specifications for parsing the document's contents.
Transitional versus Strict
The latest SGML-based specification HTML 4.01 and the earliest XHTML version include three sub-specifications: Strict, Transitional (once called Loose), and Frameset.
The Strict variant represents the standard proper, whereas the Transitional and Frameset variants were developed to assist in the transition from earlier versions of HTML (including HTML 3.2).
The Transitional and Frameset variants allow for presentational markup whereas the Strict variant encourages the use of style sheets through its omission of most presentational markup.
The primary differences which make the Transitional variant more permissive than the Strict variant (the differences as the same in HTML 4 and XHTML 1.0) are:
A looser content model
Inline elements and plain text (#PCDATA) are allowed directly in:
,
,
,
and
Presentation related elements
underline (
)
strike-through (
)
Presentation related attributes
and
attributes for
element.
attribute on
,
,
(p), and heading (
...
) elements
,
,
, and
attributes on
element
,
,
, and
attributes on
and
elements
attribute on
and
elements
and
on
element
,
,
,
on
and
elements
attribute on
element
attribute on
element
attribute on
,
and
elements
,
, and
attributes on
and
elements
and
attributes on
element
attribute on
element
Additional elements in Transitional specification
list (no substitute, though unordered list is recommended; may return in XHTML 2.0 specification)
list (no substitute, though unordered list is recommended)
(element requires server-side support and is typically added to documents server-side)
(deprecated in favor of object element)
The
attribute on script element (presumably redundant with
attribute, though this is maintained for legacy reasons).
Frame related entities
element (used in place of body for frameset DTD)
element
attribute on
, client-side image-map (
),
,
, and
elements
Frameset versus transitional
In addition to the above transitional differences, the frameset specifications (whether XHTML 1.0 or HTML 4.01) specifies a different content model:
Summary of flavors
As this list demonstrates, the loose flavors of the specification are maintained for legacy support.
However, contrary to popular misconceptions, the move to XHTML does not imply a removal of this legacy support.
Rather the X in XML stands for extensible and the W3C is modularizing the entire specification and opening it up to independent extensions.
The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is the modularization of the entire specification.
The strict version of HTML is deployed in XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification.
Likewise someone looking for the loose (transitional) or frameset specifications will find similar extended XHTML 1.1 support (much of it is contained in the legacy or frame modules).
The modularization also allows for separate features to develop on their own timetable.
So for example XHTML 1.1 will allow quicker migration to emerging XML standards such as MathML (a presentational and semantic math language based on XML) and XForms — a new highly advanced web-form technology to replace the existing HTML forms.
In summary, the HTML 4.01 specification primarily reined in all the various HTML implementations into a single clear written specification based on SGML.
XHTML 1.0, ported this specification, as is, to the new XML defined specification.
Next, XHTML 1.1 takes advantage of the extensible nature of XML and modularizes the whole specification.
XHTML 2.0 will be the first step in adding new features to the specification in a standards-body-based approach.
Hypertext features not in HTML
HTML lacks some of the features found in earlier hypertext systems, such as typed links, transclusion, source tracking, fat links, and more.
Even some hypertext features that were in early versions of HTML have been ignored by most popular web browsers until recently, such as the link element and in-browser Web page editing.
Sometimes Web services or browser manufacturers remedy these shortcomings.
For instance, wikis and content management systems allow surfers to edit the Web pages they visit.
IBM
International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue," , is a multinational computer technology and consulting corporation headquartered in Armonk, New York, USA.
The company is one of the few information technology companies with a continuous history dating back to the 19th century.
IBM manufactures and sells computer hardware and software, and offers infrastructure services, hosting services, and consulting services in areas ranging from mainframe computers to nanotechnology.
IBM has been known through most of its recent history as the world's largest computer company; with over 388,000 employees worldwide, IBM is the largest information technology employer in the world.
Despite falling behind Hewlett-Packard in total revenue since 2006, it remains the most profitable.
IBM holds more patents than any other U.S. based technology company.
It has engineers and consultants in over 170 countries and IBM Research has eight laboratories worldwide.
IBM employees have earned three Nobel Prizes, four Turing Awards, five National Medals of Technology, and five National Medals of Science.
As a chip maker, IBM has been among the Worldwide Top 20 Semiconductor Sales Leaders in past years, and in 2007 IBM ranked second in the list of largest software companies in the world.
History
The company which became IBM was founded in 1896 as the Tabulating Machine Company by Herman Hollerith, in Broome County, New York (Endicott, New York, Where it still maintains very limited operations).
It was incorporated as Computing Tabulating Recording Corporation (CTR) on June 16, 1911, and was listed on the New York Stock Exchange in 1916.
IBM adopted its current name in 1924, when it became a Fortune 500 company.
In the 1950s, IBM became the dominant vendor in the emerging computer industry with the release of the IBM 701 and other models in the IBM 700/7000 series of mainframes.
The company's dominance became even more pronounced in the 1960s and 1970s with the IBM System/360 and IBM System/370 mainframes, however antitrust actions by the United States Department of Justice, the rise of minicomputer companies like Digital Equipment Corporation and Data General, and the introduction of the microprocessor all contributed to dilution of IBM's position in the industry, eventually leading the company to diversify into other areas including personal computers, software, and services.
In 1981 IBM introduced the IBM Personal Computer which is the original version and progenitor of the IBM PC compatible hardware platform.
Descendants of the IBM PC compatibles make up the majority of microcomputers on the market today.
IBM sold its PC division to the Chinese company Lenovo on May 1, 2005 for $655 million in cash and $600 million in Lenovo stock.
On January 25, 2007, Ricoh announced purchase of IBM Printing Systems Division for $725 million and investment in 3-year joint venture to form a new Ricoh subsidiary, InfoPrint Solutions Company; Ricoh will own a 51% share, and IBM will own a 49% share in InfoPrint.
Controversies
The author Edwin Black has alleged that, during World War II, IBM CEO Thomas J. Watson used overseas subsidiaries to provide the Third Reich with unit record data processing machines, supplies and services that helped the Nazis to efficiently track down European Jews, with sizable profits for the company.
IBM denies that they had control over these subsidiaries after the Nazis took power.
A lawsuit against IBM based on these allegations was dismissed.
In support of the Allied war effort in World War II, from 1943 to 1945 IBM produced approximately 346,500 M1 Carbine (Caliber .30 carbine) light rifles for the U.S. Military.
Current projects
Eclipse
Eclipse is a platform-independent, Java-based software framework.
Eclipse was originally a proprietary product developed by IBM as a successor of the VisualAge family of tools.
Eclipse has subsequently been released as free/open source software under the Eclipse Public License.
developerWorks
developerWorks is a website run by IBM for software developers and IT professionals.
It contains a large number of how-to articles and tutorials, as well as software downloads and code samples, discussion forums, podcasts, blogs, wikis, and other resources for developers and technical professionals.
Subjects range from open, industry-standard technologies like Java, Linux, SOA and web services, web development, Ajax, PHP, and XML to IBM's products (WebSphere, Rational, Lotus, Tivoli and DB2).
In 2007 developerWorks was inducted into the Jolt Hall of Fame.
alphaWorks
alphaWorks is IBM's source for emerging software technologies.
These technologies include:
Flexible Internet Evaluation Report Architecture - A highly flexible architecture for the design, display, and reporting of Internet surveys.
IBM History Flow Visualization Application - A tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors.
IBM Linux on POWER Performance Simulator - A tool that provides users of Linux on Power a set of performance models for IBM's POWER processors.
Database File Archive And Restoration Management - An application for archiving and restoring hard disk files using file references stored in a database.
Policy Management for Autonomic Computing - A policy-based autonomic management infrastructure that simplifies the automation of IT and business processes.
FairUCE - A spam filter that verifies sender identity instead of filtering content.
Unstructured Information Management Architecture (UIMA) SDK - A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information.
Accessibility Browser - A web-browser specifically designed to assist people with visual impairments, to be released as open-source software.
Also known as the "A-Browser," the technology will aim to eliminate the need for a mouse, relying instead completely on voice-controls, buttons and predefined shortcut keys.
Semiconductor design and manufacturing
Virtually all modern console gaming systems use microprocessors developed by IBM.
The Xbox 360 contains the Xenon tri-core processor, which was designed and produced by IBM in less than 24 months.
Sony's PlayStation 3 features the Cell BE microprocessor designed jointly by IBM, Toshiba, and Sony.
Nintendo's seventh-generation console, Wii, features an IBM chip codenamed Broadway.
The older Nintendo GameCube also utilizes the Gekko processor, designed by IBM.
In May 2002, IBM and Butterfly.net, Inc. announced the Butterfly Grid, a commercial grid for the online video gaming market.
In March 2006, IBM announced separate agreements with Hoplon Infotainment, Online Game Services Incorporated (OGSI), and RenderRocket to provide on-demand content management and blade server computing resources.
Open Client Offering
IBM announced it will launch its new software, called "Open Client Offering" which is to run on Microsoft's Windows, Linux and Apple's Macintosh.
The company states that its new product allows businesses to offer employees a choice of using the same software on Windows and its alternatives.
This means that "Open Client Offering" is to cut costs of managing whether Linux or Apple relative to Windows.
There will be no necessity for companies to pay Microsoft for its licenses for operations since the operations will no longer rely on software which is Windows-based.
One of Microsoft's office alternatives is the Open Document Format software, whose development IBM supports.
It is going to be used for several tasks like: word processing, presentations, along with collaboration with Lotus Notes, instant messaging and blog tools as well as an Internet Explorer competitor – the Firefox web browser.
IBM plans to install Open Client on 5 percent of its desktop PCs.
UC2: Unified Communications and Collaboration
UC2 (Unified Communications and Collaboration) is an IBM and Cisco joint project based on Eclipse and OSGi.
It will offer the numerous Eclipse application developers a unified platform for an easier work environment.
The software based on UC2 platform will provide major enterprises with easy-to-use communication solutions, such as the Lotus based Sametime.
In the future the Sametime users will benefit from such additional functions as click-to-call and voice mailing.
Internal programs
Extreme Blue is a company initiative that uses experienced IBM engineers, talented interns, and business managers to develop high-value technology.
The project is designed to analyze emerging business needs and the technologies that can solve them.
These projects mostly involve rapid-prototyping of high-profile software and hardware projects.
In May 2007, IBM unveiled Project Big Green -- a re-direction of $1 billion per year across its businesses to increase energy efficiency.
IBM Software Group
This group is one of the major divisions of IBM.
The various brands include:
Information Management Software — database servers and tools, text analytics, content management, business process management and business intelligence.
Lotus Software — Groupware, collaboration and business software.
Acquired in 1995.
Rational Software — Software development and application lifecycle management.
Acquired in 2002.
Tivoli Software — Systems management.
Acquired in 1996.
WebSphere — Integration and application infrastructure software.
Environmental record
IBM has a long history of dealing with its environmental problems.
It established a corporate policy on environmental protection in 1971, with the support of a comprehensive global environmental management system.
According to IBM’s stats, its total hazardous waste decreased by 44 percent over the past five years, and has decreased by 94.6 percent since 1987.
IBM's total hazardous waste calculation consists of waste from both non-manufacturing and manufacturing operations.
Waste from manufacturing operations includes waste recycled in closed-loop systems where process chemicals are recovered and for subsequent reuse, rather than just disposing and using new chemical materials.
Over the years, IBM has redesigned processes to eliminate almost all closed loop recycling and now uses more environmental-friendly materials in their place.
IBM was recognized as one of the "Top 20 Best Workplaces for Commuters" by the U.S. Environmental Protection Agency (EPA) in 2005.
This was to recognize the Fortune 500 companies that provided their employees with excellent commuter benefits that helped reduce traffic and air pollution.
However, the birthplace of IBM, Endicott, suffered IBM's pollution for decades.
IBM used liquid cleaning agents in its circuit board assembly operation for more than two decades, and six spills and leaks incidents were recorded, including one 1979 leak of 4,100 gallons from an underground tank.
These left behind volatile organic compounds in the town's soil and aquifer.
Trace elements of volatile organic compounds have been identified in the Endicott’s drinking water, but the levels are within regulatory limits.
Also, from 1980, IBM has pumped out 78,000 gallons of chemicals, including trichloroethane, Freon, benzene and perchloroethene to the air and allegedly caused several cancer cases among the villagers.
IBM Endicott has been identified by the Department of Environmental Conservation as the major source of pollution, though traces of contaminants from a local dry cleaner and other polluters were also found.
Despite the amount of pollutant, state health officials cannot say whether air or water pollution in Endicott has actually caused any health problems.
Village officials say tests show that the water is safe to drink.
Solar power
Tokyo Ohka Kogyo Co., Ltd. (TOK) and IBM are collaborating to establish new, low-cost methods for bringing the next generation of solar energy products to market,this is, CIGS (Copper-Indium-Gallium-Selenide) solar cell modules.
Use of thin film technology, such as CIGS, has great promise in reducing the overall cost of solar cells and further enabling their widespread adoption.
IBM is exploring four main areas of photovoltaic research: using current technologies to develop cheaper and more efficient silicon solar cells, developing new solution processed thin film photovoltaic devices, concentrator photovoltaics, and future generation photovoltaic architectures based upon nanostructures such as semiconductor quantum dots and nanowires.
Dr. Supratik Guha is the leading scientist in IBM photovoltaics.
Corporate culture of IBM
Big Blue is a nickname for IBM; several theories exist regarding its origin.
One theory, substantiated by people who worked for IBM at the time, is that IBM field reps coined the term in the 1960s, referring to the color of the mainframes IBM installed in the 1960s and early 1970s.
"All blue" was a term used to describe a loyal IBM customer, and business writers later picked up the term.
Another theory suggests that Big Blue simply refers to the Company's logo.
A third theory suggests that Big Blue refers to a former company dress code that required many IBM employees to wear only white shirts and many wore blue suits.
In any event, IBM keyboards, typewriters, and some other manufactured devices, have played on the "Big Blue" concept, using the color for enter keys and carriage returns.
Sales
IBM has often been described as having a sales-centric or a sales-oriented business culture.
Traditionally, many IBM executives and general managers are chosen from the sales force.
The current CEO, Sam Palmisano, for example, joined the company as a salesman and, unusually for CEOs of major corporations, has no MBA or postgraduate qualification.
Middle and top management are often enlisted to give direct support to salesmen when pitching sales to important customers.
The uniform
A dark (or gray) suit, white shirt, and a "sincere" tie was the public uniform for IBM employees for most of the 20th Century.
During IBM's management transformation in the 1990s, CEO Lou Gerstner relaxed these codes, normalizing the dress and behavior of IBM employees to resemble their counterparts in other large technology companies.
IBM company values and "Jam"
In 2003, IBM embarked on an ambitious project to rewrite company values.
Using its Jam technology, the company hosted Intranet-based online discussions on key business issues with 50,000 employees over 3 days.
The discussions were analyzed by sophisticated text analysis software (eClassifier) to mine online comments for themes.
As a result of the 2003 Jam, the company values were updated to reflect three modern business, marketplace and employee views: "Dedication to every client's success", "Innovation that matters - for our company and for the world", "Trust and personal responsibility in all relationships".
In 2004, another Jam was conducted during which 52,000 employees exchanged best practices for 72 hours.
They focused on finding actionable ideas to support implementation of the values previously identified.
A new post-Jam Ratings event was developed to allow IBMers to select key ideas that support the values.
The board of directors cited this Jam when awarding Palmisano a pay rise in the spring of 2005.
In July and September 2006, Palmisano launched another jam called [https://www.globalinnovationjam.com/ InnovationJam].
InnovationJam was the largest online brainstorming session ever with more than 150,000 participants from 104 countries.
The participants were IBM employees, members of IBM employees' families, universities, partners, and customers.
InnovationJam was divided in two sessions (one in July and one in September) for 72 hours each and generated more than 46,000 ideas.
In November 2006, IBM declared that they will invest $US 100 million in the 10 best ideas from InnovationJam.
Open source
IBM has been influenced by the Open Source Initiative, and began supporting Linux in 1998.
The company invests billions of dollars in services and software based on Linux through the IBM Linux Technology Center, which includes over 300 Linux kernel developers.
IBM has also released code under different open-source licenses, such as the platform-independent software framework Eclipse (worth approximately US$40 million at the time of the donation) and the Java-based relational database management system (RDBMS) Apache Derby.
IBM's open source involvement has not been trouble-free, however (see SCO v. IBM).
Corporate affairs
Diversity and workforce issues
IBM's efforts to promote workforce diversity and equal opportunity date back at least to World War I, when the company hired disabled veterans.
IBM was the only technology company ranked in Working Mother magazine's Top 10 for 2004, and one of two technology companies in 2005 (the other company being Hewlett-Packard).
On September 21, 1953, Thomas J. Watson, the CEO at the time, sent out a very controversial letter to all IBM employees stating that IBM needed to hire the best people, regardless of their race, ethnic origin, or gender.
In 1984, IBM added sexual preference.
He stated that this would give IBM a competitive advantage because IBM would then be able to hire talented people its competitors would turn down.
The company has traditionally resisted labor union organizing, although unions represent some IBM workers outside the United States.
In the 1990s, two major pension program changes, including a conversion to a cash balance plan, resulted in an employee class action lawsuit alleging age discrimination.
IBM employees won the lawsuit and arrived at a partial settlement, although appeals are still underway.
IBM also settled a major overtime class-action lawsuit in 2006.
Historically IBM has had a good reputation of long-term staff retention with few large scale layoffs.
In more recent years there have been a number of broad sweeping cuts to the workforce as IBM attempts to adapt to changing market conditions and a declining profit base.
After posting weaker than expected revenues in the first quarter of 2005, IBM eliminated 14,500 positions from its workforce, predominantly in Europe.
In May 2005, IBM Ireland said to staff that the MD(Micro-electronics Division) facility was closing down by the end of 2005 and offered a settlement to staff.
However, all staff that wished to stay with the Company were redeployed within IBM Ireland.
The production moved to a company called Amkor in Singapore who purchased IBM's Microelectronics business in Singapore and is widely agreed that IBM promised this Company a full load capacity in return for the purchase of the facility.
On June 8 2005, IBM Canada Ltd. eliminated approximately 700 positions.
IBM projects these as part of a strategy to "rebalance" its portfolio of professional skills & businesses.
IBM India and other IBM offices in China, the Philippines and Costa Rica have been witnessing a recruitment boom and steady growth in number of employees due to lower wages.
On October 10 2005, IBM became the first major company in the world to formally commit to not using genetic information in its employment decisions.
This came just a few months after IBM announced its support of the National Geographic Society's Genographic Project.
Gay rights
IBM provides employees' same-sex partners with benefits and provides an anti-discrimination clause.
The Human Rights Campaign has consistently rated IBM 100% on its index of gay-friendliness since 2003 (in 2002, the year it began compiling its report on major companies, IBM scored 86%).
Logos
Logos designed in the 1970s tended to be sensitive to the technical limitations of photocopiers, which were then being widely deployed.
A logo with large solid areas tended to be poorly copied by copiers in the 1970s, so companies preferred logos that avoided large solid areas.
The 1972 IBM logos are an example of this tendency.
With the advent of digital copiers in the mid-1980s this technical restriction had largely disappeared; at roughly the same time, the 13-bar logo was abandoned for almost the opposite reason it was difficult to render accurately on the low-resolution digital printers (240 dots per inch) of the time.
Board of directors
Current members of the board of directors of IBM are:
Cathleen Black President, Hearst Magazines
William Brody President, Johns Hopkins University
Ken Chenault Chairman and CEO, American Express Company
Juergen Dormann Chairman of the Board, ABB Ltd
Michael Eskew Chairman and CEO, United Parcel Service, Inc.
Shirley Ann Jackson President, Rensselaer Polytechnic Institute
Minoru Makihara Senior Corporate Advisor and former Chairman, Mitsubishi Corporation
Lucio Noto Managing Partner, Midstream Partners LLC
James W. Owens Chairman and CEO, Caterpillar Inc.
Samuel J. Palmisano Chairman, President and CEO, IBM
Joan Spero President, Doris Duke Charitable Foundation
Sidney Taurell, Chairman and CEO, Eli Lilly and Company
Lorenzo Zambrano Chairman and CEO, Cemex SAB de CV
Information
Information as a concept has a diversity of meanings, from everyday usage to technical settings.
Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
Many people speak about the Information Age as the advent of the Knowledge Age or knowledge society, the information society, the Information revolution, and information technologies, and even though informatics, information science and computer science are often in the spotlight, the word "information" is often used without careful consideration of the various meanings it has acquired.
Etymology
According to the Oxford English Dictionary, the earliest historical meaning of the word information in English was the act of informing, or giving form or shape to the mind, as in education, instruction, or training.
A quote from 1387: "Five books come down from heaven for information of mankind."
It was also used for an item of training, e.g. a particular instruction.
"Melibee had heard the great skills and reasons of Dame Prudence, and her wise information and techniques."
(1386)
The English word was apparently derived by adding the common "noun of action" ending "-ation" (descended through Francais from Latin "-tio") to the earlier verb to inform, in the sense of to give form to the mind, to discipline, instruct, teach: "Men so wise should go and inform their kings."
(1330) Inform itself comes (via French) from the Latin verb informare, to give form to, to form an idea of.
Furthermore, Latin itself already even contained the word informatio meaning concept or idea, but the extent to which this may have influenced the development of the word information in English is unclear.
As a final note, the ancient Greek word for form was [eidos], and this word was famously used in a technical philosophical sense by [Plato] (and later Aristotle) to denote the ideal identity or essence of something (see [Theory of forms]).
"Eidos" can also be associated with [thought], [proposition] or even [concept].
Information as a message
Information is the state of a system of interest.
Message is the information materialized.
Information is a quality of a message from a sender to one or more receivers.
Information is always about something (size of a parameter, occurrence of an event, etc).
Viewed in this manner, information does not have to be accurate.
It may be a truth or a lie, or just the sound of a falling tree.
Even a disruptive noise used to inhibit the flow of communication and create misunderstanding would in this view be a form of information.
However, generally speaking, if the amount of information in the received message increases, the message is more accurate.
This model assumes there is a definite sender and at least one receiver.
Many refinements of the model assume the existence of a common language understood by the sender and at least one of the receivers.
An important variation identifies information as that which would be communicated by a message if it were sent from a sender to a receiver capable of understanding the message.
Notably, it is not required that the sender be capable of understanding the message, or even cognizant that there is a message.
Thus, information is something that can be extracted from an environment, e.g., through observation, reading or measurement.
Information is a term with many meanings depending on context, but is as a rule closely related to such concepts as meaning, knowledge, instruction, communication, representation, and mental stimulus.
Simply stated, information is a message received and understood.
In terms of data, it can be defined as a collection of facts from which conclusions may be drawn.
There are many other aspects of information since it is the knowledge acquired through study or experience or instruction.
But overall, information is the result of processing, manipulating and organizing data in a way that adds to the knowledge of the person receiving it.
Communication theory provides a numerical measure of the uncertainty of an outcome.
For example, we can say that "the signal contained thousands of bits of information".
Communication theory tends to use the concept of information entropy, generally attributed to C.E. Shannon (see below).
Another form of information is Fisher information, a concept of R.A. Fisher.
This is used in application of statistics to estimation theory and to science in general.
Fisher information is thought of as the amount of information that a message carries about an unobservable parameter.
It can be computed from knowledge of the likelihood function defining the system.
For example, with a normal likelihood function, the Fisher information is the reciprocal of the variance of the law.
In the absence of knowledge of the likelihood law, the Fisher information may be computed from normally distributed score data as the reciprocal of their second moment.
Even though information and data are often used interchangeably, they are actually very different.
Data is a set of unrelated information, and as such is of no use until it is properly evaluated.
Upon evaluation, once there is some significant relation between data, and they show some relevance, then they are converted into information.
Now this same data can be used for different purposes.
Thus, till the data convey some information, they are not useful.
Measuring information entropy
The view of information as a message came into prominence with the publication in 1948 of an influential paper by Claude Shannon, "A Mathematical Theory of Communication."
This paper provides the foundations of information theory and endows the word information not only with a technical meaning but also a measure.
If the sending device is equally likely to send any one of a set of messages, then the preferred measure of "the information produced when one message is chosen from the set" is the base two logarithm of (This measure is called self-information).
In this paper, Shannon continues:
A complementary way of measuring information is provided by algorithmic information theory.
In brief, this measures the information content of a list of symbols based on how predictable they are, or more specifically how easy it is to compute the list through a program: the information content of a sequence is the number of bits of the shortest program that computes it.
The sequence below would have a very low algorithmic information measurement since it is a very predictable pattern, and as the pattern continues the measurement would not change.
Shannon information would give the same information measurement for each symbol, since they are statistically random, and each new symbol would increase the measurement.
123456789101112131415161718192021
It is important to recognize the limitations of traditional information theory and algorithmic information theory from the perspective of human meaning.
For example, when referring to the meaning content of a message Shannon noted “Frequently the messages have meaning… these semantic aspects of communication are irrelevant to the engineering problem.
The significant aspect is that the actual message is one selected from a set of possible messages” (emphasis in original).
In information theory signals are part of a process, not a substance; they do something, they do not contain any specific meaning.
Combining algorithmic information theory and information theory we can conclude that the most random signal contains the most information as it can be interpreted in any way and cannot be compressed.
Michael Reddy noted that "'signals' of the mathematical theory are 'patterns that can be exchanged'.
There is no message contained in the signal, the signals convey the ability to select from a set of possible messages."
In information theory "the system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design".
Information as a pattern
Information is any represented pattern.
This view assumes neither accuracy nor directly communicating parties, but instead assumes a separation between an object and its representation.
Consider the following example: economic statistics represent an economy, however inaccurately.
What are commonly referred to as data in computing, statistics, and other fields, are forms of information in this sense.
The electro-magnetic patterns in a computer network and connected devices are related to something other than the pattern itself, such as text characters to be displayed and keyboard input.
Signals, signs, and symbols are also in this category.
On the other hand, according to semiotics, data is symbols with certain syntax and information is data with a certain semantic.
Painting and drawing contain information to the extent that they represent something such as an assortment of objects on a table, a profile, or a landscape.
In other words, when a pattern of something is transposed to a pattern of something else, the latter is information.
This would be the case whether or not there was anyone to perceive it.
But if information can be defined merely as a pattern, does that mean that neither utility nor meaning are necessary components of information?
Arguably a distinction must be made between raw unprocessed data and information which possesses utility, value or some quantum of meaning.
On this view, information may indeed be characterized as a pattern; but this is a necessary condition, not a sufficient one.
An individual entry in a telephone book, which follows a specific pattern formed by name, address and telephone number, does not become "informative" in some sense unless and until it possesses some degree of utility, value or meaning.
For example, someone might look up a girlfriend's number, might order a take away etc.
The vast majority of numbers will never be construed as "information" in any meaningful sense.
The gap between data and information is only closed by a behavioral bridge whereby some value, utility or meaning is added to transform mere data or pattern into information.
When one constructs a representation of an object, one can selectively extract from the object (sampling) or use a system of signs to replace (encoding), or both.
The sampling and encoding result in representation.
An example of the former is a "sample" of a product; an example of the latter is "verbal description" of a product.
Both contain information of the product, however inaccurate.
When one interprets representation, one can predict a broader pattern from a limited number of observations (inference) or understand the relation between patterns of two different things (decoding).
One example of the former is to sip a soup to know if it is spoiled; an example of the latter is examining footprints to determine the animal and its condition.
In both cases, information sources are not constructed or presented by some "sender" of information.
Regardless, information is dependent upon, but usually unrelated to and separate from, the medium or media used to express it.
In other words, the position of a theoretical series of bits, or even the output once interpreted by a computer or similar device, is unimportant, except when someone or something is present to interpret the information.
Therefore, a quantity of information is totally distinct from its medium.
Information as sensory input
Often information is viewed as a type of input to an organism or designed device.
Inputs are of two kinds.
Some inputs are important to the function of the organism (for example, food) or device (energy) by themselves.
In his book Sensory Ecology, Dusenbery called these causal inputs.
Other inputs (information) are important only because they are associated with causal inputs and can be used to predict the occurrence of a causal input at a later time (and perhaps another place).
Some information is important because of association with other information but eventually there must be a connection to a causal input.
In practice, information is usually carried by weak stimuli that must be detected by specialized sensory systems and amplified by energy inputs before they can be functional to the organism or device.
For example, light is often a causal input to plants but provides information to animals.
The colored light reflected from a flower is too weak to do much photosynthetic work but the visual system of the bee detects it and the bee's nervous system uses the information to guide the bee to the flower, where the bee often finds nectar or pollen, which are causal inputs, serving a nutritional function.
Information is any type of sensory input.
When an organism with a nervous system receives an input, it transforms the input into an electrical signal.
This is regarded information by some.
The idea of representation is still relevant, but in a slightly different manner.
That is, while abstract painting does not represent anything concretely, when the viewer sees the painting, it is nevertheless transformed into electrical signals that create a representation of the painting.
Defined this way, information does not have to be related to truth, communication, or representation of an object.
Entertainment in general is not intended to be informative.
Music, the performing arts, amusement parks, works of fiction and so on are thus forms of information in this sense, but they are not necessarily forms of information according to some definitions given above.
Consider another example: food supplies both nutrition and taste for those who eat it.
If information is equated to sensory input, then nutrition is not information but taste is.
Information as an influence which leads to a transformation
Information is any type of pattern that influences the formation or transformation of other patterns.
In this sense, there is no need for a conscious mind to perceive, much less appreciate, the pattern.
Consider, for example, DNA.
The sequence of nucleotides is a pattern that influences the formation and development of an organism without any need for a conscious mind.
Systems theory at times seems to refer to information in this sense, assuming information does not necessarily involve any conscious mind, and patterns circulating (due to feedback) in the system can be called information.
In other words, it can be said that information in this sense is something potentially perceived as representation, though not created or presented for that purpose.
When Marshall McLuhan speaks of media and their effects on human cultures, he refers to the structure of artifacts that in turn shape our behaviors and mindsets.
Also, pheromones are often said to be "information" in this sense.
(See also Gregory Bateson.)
Information as a property in physics
In 2003, J. D. Bekenstein claimed there is a growing trend in physics to define the physical world as being made of information itself (and thus information is defined in this way).
Information has a well defined meaning in physics.
Examples of this include the phenomenon of quantum entanglement where particles can interact without reference to their separation or the speed of light.
Information itself cannot travel faster than light even if the information is transmitted indirectly.
This could lead to the fact that all attempts at physically observing a particle with an "entangled" relationship to another are slowed down, even though the particles are not connected in any other way other than by the information they carry.
Another link is demonstrated by the Maxwell's demon thought experiment.
In this experiment, a direct relationship between information and another physical property, entropy, is demonstrated.
A consequence is that it is impossible to destroy information without increasing the entropy of a system; in practical terms this often means generating heat.
Another, more philosophical, outcome is that information could be thought of as interchangeable with energy.
Thus, in the study of logic gates, the theoretical lower bound of thermal energy released by an AND gate is higher than for the NOT gate (because information is destroyed in an AND gate and simply converted in a NOT gate).
Physical information is of particular importance in the theory of quantum computers.
Information as records
Records are a specialized form of information.
Essentially, records are information produced consciously or as by-products of business activities or transactions and retained because of their value.
Primarily their value is as evidence of the activities of the organization but they may also be retained for their informational value.
Sound records management ensures that the integrity of records is preserved for as long as they are required.
The international standard on records management, ISO 15489, defines records as "information created, received, and maintained as evidence and information by an organization or person, in pursuance of legal obligations or in the transaction of business".
The International Committee on Archives (ICA) Committee on electronic records defined a record as, "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity".
Records may be retained because of their business value, as part of the corporate memory of the organization or to meet legal, fiscal or accountability requirements imposed on the organization.
Willis (2005) expressed the view that sound management of business records and information delivered "…six key requirements for good corporate governance…transparency; accountability; due process; compliance; meeting statutory and common law requirements; and security of personal and corporate information."
Information and semiotics
Beynon-Davies explains the multi-faceted concept of information in terms of that of signs and sign-systems.
Signs themselves can be considered in terms of four inter-dependent levels, layers or branches of semiotics: pragmatics, semantics, syntactics and empirics.
These four layers serve to connect the social world on the one hand with the physical or technical world on the other.
Pragmatics is concerned with the purpose of communication.
Pragmatics links the issue of signs with that of intention.
The focus of pragmatics is on the intentions of human agents underlying communicative behaviour.
In other words, intentions link language to action.
Semantics is concerned with the meaning of a message conveyed in a communicative act.
Semantics considers the content of communication.
Semantics is the study of the meaning of signs - the association between signs and behaviour.
Semantics can be considered as the study of the link between symbols and their referents or concepts; particularly the way in which signs relate to human behaviour.
Syntactics is concerned with the formalism used to represent a message.
Syntactics as an area studies the form of communication in terms of the logic and grammar of sign systems.
Syntactics is devoted to the study of the form rather than the content of signs and sign-systems.
Empirics is the study of the signals used to carry a message; the physical characteristics of the medium of communication.
Empirics is devoted to the study of communication channels and their characteristics, e.g., sound, light, electronic transmission etc.
Communication normally exists within the context of some social situation.
The social situation sets the context for the intentions conveyed (pragmatics) and the form in which communication takes place.
In a communicative situation intentions are expressed through messages which comprise collections of inter-related signs taken from a language which is mutually understood by the agents involved in the communication.
Mutual understanding implies that agents involved understand the chosen language in terms of its agreed syntax (syntactics) and semantics.
The sender codes the message in the language and sends the message as signals along some communication channel (empirics).
The chosen communication channel will have inherent properties which determine outcomes such as the speed with which communication can take place and over what distance.
Information extraction
In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.
An example of information extraction is the extraction of instances of corporate mergers, more formally , from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp."
A broad goal of IE is to allow computation to be done on the previously unstructured data.
A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data.
The significance of IE is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet.
This knowledge can be made more accessible by means of transformation into relational form, or by marking-up with XML tags.
An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.
A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.
Current approaches to IE use natural language processing techniques that focus on very restricted domains.
For example, the Message Understanding Conference (MUC) is a competition-based conference that focused on the following domains in the past:
MUC-1 (1987), MUC-2 (1989): Naval operations messages.
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
MUC-5 (1993): Joint ventures and microelectronics domain.
MUC-6 (1995): News articles on management changes.
MUC-7 (1998): Satellite launch reports.
Natural Language texts may need to use some form of a Text simplification to create a more easily machine readable text to extract the sentences.
Typical subtasks of IE are:
Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
Coreference: identification chains of noun phrases that refer to the same object.
For example, anaphora is a type of coreference.
Terminology extraction: finding the relevant terms for a given corpus
Relation Extraction: identification of relations between entities, such as:
PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Information retrieval
Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web.
There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis and technologies.
IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, statistics and physics.
Automated information retrieval systems are used to reduce what has been called "information overload".
Many universities and public libraries use IR systems to provide access to books, journals and other documents.
Web search engines are the most visible IR applications.
History
The idea of using computers to search for relevant pieces of information was popularized in an article As We May Think by Vannevar Bush in 1945.
First implementations of information retrieval systems were introduced in the 1950s and 1960s.
By 1990 several different techniques had been shown to perform well on small text corpora (several thousand documents).
In 1992 the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program.
The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection.
This catalyzed research on methods that scale to huge corpora.
The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence, where a digital resource ceases to be readable because the physical media, the reader required to read the media, the hardware, or the software that runs on it, is no longer available.
The information is initially easier to retrieve than if it were on paper, but is then effectively lost.
Timeline
1890: Hollerith tabulating machines were used to analyze the US census.
(Herman Hollerith).
1945: Vannevar Bush's As We May Think appeared in Atlantic Monthly
Late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans.
1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized, punch card based system for searching chemical compounds.
1950: The term "information retrieval" may have been coined by Calvin Mooers.
1950s: Growing concern in the US for a "science gap" with the USSR motivated, encouraged funding, and provided a backdrop for mechanized literature searching systems (Allen Kent et al) and the invention of citation indexing (Eugene Garfield).
1955: Allen Kent joined Case Western Reserve University, and eventually becomes associate director of the Center for Documentation and Communications Research.
That same year, Kent and colleagues publish a paper in American Documentation describing the precision and recall measures, as well as detailing a proposed "framework" for evaluating an IR system, which includes statistical sampling methods for determining the number of relevant documents not retrieved.
1958: International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified.
See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959)
1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
1960: Melvin Earl (Bill) Maron and J. L. Kuhns published "On relevance, probabilistic indexing, and information retrieval" in Journal of the ACM 7(3):216-244, July 1960.
Early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell.
1962: Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation.
See: Cyril W. Cleverdon, "Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems".
Cranfield Coll. of Aeronautics, Cranfield, England, 1962.
1962: Kent published Information Analysis and Retrieval
1963: Weinberg report "Science, Government and Information" gave a full articulation of the idea of a "crisis of scientific information."
The report was named after Dr. Alvin Weinberg.
1963: Joseph Becker and Robert M. Hayes published text on information retrieval.
Becker, Joseph; Hayes, Robert Mayo.
Information storage and retrieval: tools, elements, theories.
New York, Wiley (1963).
1964: Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR
1964: The National Bureau of Standards sponsored a symposium titled "Statistical Association Methods for Mechanized Documentation."
Several highly significant papers, including G. Salton's first published reference (we believe) to the SMART system.
Mid-1960s: National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch retrieval system
Mid-1960s: Project Intrex at MIT
1965: J. C. R. Licklider published Libraries of the Future
1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs
1968: Gerard Salton published Automatic Information Organization and Retrieval.
1968: J. W. Sammon's RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model.
1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers) was the first proposal for visualization interface to an IR system.
Late 1960s: F. W. Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval
Early 1970s: first online systems--NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT
Early 1970s: Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines
1971: N. Jardine and C. J. Van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis."
(Information Storage and Retrieval, 7(5), pp. 217-240, Dec 1971)
1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:
A Theory of Indexing (Society for Industrial and Applied Mathematics)
"A theory of term importance in automatic text analysis", (JASIS v. 26)
"A vector space model for automatic indexing", (CACM 18:11)
1978: The First ACM SIGIR conference.
1979: C. J. Van Rijsbergen published Information Retrieval (Butterworths).
Heavy emphasis on probabilistic models.
1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge
1982: Belkin, Oddy, and Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval.
This was an important concept, though their automated analysis tool proved ultimately disappointing.
1983: Salton (and M. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models.
Mid-1980s: Efforts to develop end user versions of commercial IR systems.
1985-1993: Key papers on and experimental systems for visualization interfaces.
Work by D. B. Crouch, Robert R. Korfhage, M. Chalmers, A. Spoerri and others.
1989: First World Wide Web proposals by Tim Berners-Lee at CERN.
1992: First TREC conference.
1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems.
Late 1990s: Web search engine implementation of many features formerly found only in experimental IR systems
Overview
An information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs, for example search strings in web search engines.
In information retrieval a query does not uniquely identify a single object in the collection.
Instead, several objects may match the query, perhaps with different degrees of relevancy.
An object is an entity which keeps or stores information in a database.
User queries are matched to objects stored in the database.
Depending on the application the data objects may be, for example, text documents, images or videos.
Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.
Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value.
The top ranking objects are then shown to the user.
The process may then be iterated if the user wishes to refine the query.
Performance measures
Many different measures for evaluating the performance of information retrieval systems have been proposed.
The measures require a collection of documents and a query.
All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query.
In practice queries may be ill-posed and there may be different shades of relevancy.
Precision
Precision is the fraction of the documents retrieved that are relevant to the user's information need.
In binary classification, precision is analogous to positive predictive value.
Precision takes all retrieved documents into account.
It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system.
This measure is called precision at n or P@n.
Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.
Recall
Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
In binary classification, recall is called sensitivity.
So it can be looked at as the probability that a relevant document is retrieved by the query.
It is trivial to achieve recall of 100% by returning all documents in response to any query.
Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.
Fall-Out
The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:
In binary classification, fall-out is closely related to specificity.
More precisely: .
It can be looked at as the probability that a non-relevant document is retrieved by the query.
It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.
F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:
This is also known as the measure, because recall and precision are evenly weighted.
The general formula for non-negative real ß is:
Two other commonly used F measures are the measure, which weights recall twice as much as precision, and the measure, which weights precision twice as much as recall.
The F-measure was derived by van Rijsbergen (1979) so that "measures the effectiveness of retrieval with respect to a user who attaches ß times as much importance to recall as precision".
It is based on van Rijsbergen's effectiveness measure .
Their relationship is where .
Average precision of precision and recall
The precision and recall are based on the whole list of documents returned by the system.
Average precision emphasizes returning more relevant documents earlier.
It is average of precisions computed after truncating the list after each of the relevant documents in turn:
where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given rank, and P() precision at a given cut-off rank.
Model types
[[Image:Information-Retrieval-Models.png|thumb|500px|categorization of IR-models (translated from German entry, original source Dominik Kuropka)]]
For the information retrieval to be efficient, the documents are typically transformed into a suitable representation.
There are several representations.
The picture on the right illustrates the relationship of some common models.
In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.
First dimension: mathematical basis
Set-theoretic models represent documents as sets of words or phrases.
Similarities are usually derived from set-theoretic operations on those sets.
Common models are:
Standard Boolean model
Extended Boolean model
Fuzzy retrieval
Algebraic models represent documents and queries usually as vectors, matrices or tuples.
The similarity of the query vector and document vector is represented as a scalar value.
Vector space model
Generalized vector space model
Topic-based vector space model (literature: www.kuropka.net/files/TVSM.pdf, www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=)
Extended Boolean model
Enhanced topic-based vector space model (literature: kuropka.net/files/HPI_Evaluation_of_eTVSM.pdf, www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=)
Latent semantic indexing aka latent semantic analysis
Probabilistic models treat the process of document retrieval as a probabilistic inference.
Similarities are computed as probabilities that a document is relevant for a given query.
Probabilistic theorems like the Bayes' theorem are often used in these models.
Binary independence retrieval
Probabilistic relevance model (BM25)
Uncertain inference
Language models
Divergence-from-randomness model
Latent Dirichlet allocation
Second dimension: properties of the model
Models without term-interdependencies treat different terms/words as independent.
This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.
Models with immanent term interdependencies allow a representation of interdependencies between terms.
However the degree of the interdependency between two terms is defined by the model itself.
It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.
Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined.
They relay an external source for the degree of interdependency between two terms.
(For example a human or sophisticated algorithms.)
Major figures
Gerard Salton
Hans Peter Luhn
W. Bruce Croft
Karen Spärck Jones
C. J. van Rijsbergen
Stephen E. Robertson
Awards in the field
Tony Kent Strix award
Gerard Salton Award
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information.
Historically, information theory was developed to find fundamental limits on compressing and reliably communicating data.
Since its inception it has broadened to find applications in many other areas, including statistical inference, natural language processing, cryptography generally, networks other than communication networks -- as in neurobiology, the evolution and function of molecular codes, model selection in ecology, thermal physics, quantum computing, plagiarism detection and other forms of data analysis.
A key measure of information in the theory is known as information entropy, which is usually expressed by the average number of bits needed for storage or communication.
Intuitively, entropy quantifies the uncertainty involved when encountering a random variable.
For example, a fair coin flip (2 equally likely outcomes) will have less entropy than a roll of a die (6 equally likely outcomes).
Applications of fundamental topics of information theory include lossless data compression (e.g. ZIP files), lossy data compression (e.g. MP3s), and channel coding (e.g. for DSL lines).
The field is at the intersection of mathematics, statistics, computer science, physics, neurobiology, and electrical engineering.
Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the CD, the feasibility of mobile phones, the development of the Internet, the study of linguistics and of human perception, the understanding of black holes, and numerous other fields.
Important sub-fields of information theory are source coding, channel coding, algorithmic complexity theory, algorithmic information theory, and measures of information.
Overview
The main concepts of information theory can be grasped by considering the most widespread means of human communication: language.
Two important aspects of a good language are as follows: First, the most common words (e.g., "a", "the", "I") should be shorter than less common words (e.g., "benefit", "generation", "mediocre"), so that sentences will not be too long.
Such a tradeoff in word length is analogous to data compression and is the essential aspect of source coding.
Second, if part of a sentence is unheard or misheard due to noise -— e.g., a passing car -— the listener should still be able to glean the meaning of the underlying message.
Such robustness is as essential for an electronic communication system as it is for a language; properly building such robustness into communications is done by channel coding.
Source coding and channel coding are the fundamental concerns of information theory.
Note that these concerns have nothing to do with the importance of messages.
For example, a platitude such as "Thank you; come again" takes about as long to say or write as the urgent plea, "Call an ambulance!" while clearly the latter is more important and more meaningful.
Information theory, however, does not consider message importance or meaning, as these are matters of the quality of data rather than the quantity and readability of data, the latter of which is determined solely by probabilities.
Information theory is generally considered to have been founded in 1948 by Claude Shannon in his seminal work, "A Mathematical Theory of Communication."
The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel.
The most fundamental results of this theory are Shannon's source coding theorem, which establishes that, on average, the number of bits needed to represent the result of an uncertain event is given by its entropy; and Shannon's noisy-channel coding theorem, which states that reliable communication is possible over noisy channels provided that the rate of communication is below a certain threshold called the channel capacity.
The channel capacity can be approached in practice by using appropriate encoding and decoding systems.
Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: adaptive systems, anticipatory systems, artificial intelligence, complex systems, complexity science, cybernetics, informatics, machine learning, along with systems sciences of many descriptions.
Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of coding theory.
Coding theory is concerned with finding explicit methods, called codes, of increasing the efficiency and reducing the net error rate of data communication over a noisy channel to near the limit that Shannon proved is the maximum possible for that channel.
These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques.
In the latter case, it took many years to find the methods Shannon's work proved were possible.
A third class of information theory codes are cryptographic algorithms (both codes and ciphers).
Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis.
See the article ban (information) for a historical application.
Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical composition.
Historical background
The landmark event that established the discipline of information theory, and brought it to immediate worldwide attention, was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October of 1948.
Prior to this paper, limited information theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability.
Harry Nyquist's 1924 paper, Certain Factors Affecting Telegraph Speed, contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation , where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant.
Ralph Hartley's 1928 paper, Transmission of Information, uses the word information as a measurable quantity, reflecting the receiver's ability to distinguish that one sequence of symbols from any other, thus quantifying information as , where S was the number of possible symbols, and n the number of symbols in a transmission.
The natural unit of information was therefore the decimal digit, much later renamed the hartley in his honour as a unit or scale or measure of information.
Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers.
Much of the mathematics behind information theory with events of different probabilities was developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs.
Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory.
In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion that
"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."
With it came the ideas of
the information entropy and redundancy of a source, and its relevance through the source coding theorem;
the mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
the practical result of the Shannon–Hartley law for the channel capacity of a Gaussian channel; and of course
the bit—a new way of seeing the most fundamental unit of information
Ways of measuring information
Information theory is based on probability theory and statistics.
The most important quantities of information are entropy, the information in a random variable, and mutual information, the amount of information in common between two random variables.
The former quantity indicates how easily message data can be compressed while the latter can be used to find the communication rate across a channel.
The choice of logarithmic base in the following formulae determines the unit of information entropy that is used.
The most common unit of information is the bit, based on the binary logarithm.
Other units include the nat, which is based on the natural logarithm, and the hartley, which is based on the common logarithm.
In what follows, an expression of the form is considered by convention to be equal to zero whenever
This is justified because for any logarithmic base.
Entropy
The entropy, , of a discrete random variable is a measure of the amount of uncertainty associated with the value of .
Suppose one transmits 1000 bits (0s and 1s).
If these bits are known ahead of transmission (to be a certain value with absolute probability), logic dictates that no information has been transmitted.
If, however, each is equally and independently likely to be 0 or 1, 1000 bits (in the information theoretic sense) have been transmitted.
Between these two extremes, information can be quantified as follows.
If is the set of all messages that could be, and is the probability of given , then the entropy of is defined:
(Here, is the self-information, which is the entropy contribution of an individual message.)
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable—i.e., most unpredictable—in which case
The special case of information entropy for a random variable with two outcomes is the binary entropy function:
Joint entropy
The joint entropy of two discrete random variables and is merely the entropy of their pairing: .
This implies that if and are independent, then their joint entropy is the sum of their individual entropies.
For example, if represents the position of a chess piece — the row and the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
Despite similar notation, joint entropy should not be confused with cross entropy.
Conditional entropy (equivocation)
The conditional entropy or conditional uncertainty of given random variable (also called the equivocation of about ) is the average conditional entropy over :
Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use.
A basic property of this form of conditional entropy is that:
Mutual information (transinformation)
Mutual information measures the amount of information that can be obtained about one random variable by observing another.
It is important in communication where it can be used to maximize the amount of information shared between sent and received signals.
The mutual information of relative to is given by:
where (Specific mutual Information) is the pointwise mutual information.
A basic property of the mutual information is that
That is, knowing Y, we can save an average of bits in encoding X compared to not knowing Y.
Mutual information is symmetric:
Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) of the posterior probability distribution of X given the value of Y to the prior distribution on X:
In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y.
This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
Kullback–Leibler divergence (information gain)
The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions: a "true" probability distribution p(X), and an arbitrary probability distribution q(X).
If we compress data in a manner that assumes q(X) is the distribution underlying some data, when, in reality, p(X) is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression.
It is thus defined
Although it is sometimes used as a 'distance metric', it is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric).
Other quantities
Other important information theoretic quantities include Rényi entropy (a generalization of entropy) and differential entropy (a generalization of quantities of information to continuous distributions.)
Coding theory
Coding theory is one of the most important and direct applications of information theory.
It can be subdivided into source coding theory and channel coding theory.
Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
Data compression (source coding): There are two formulations for the compression problem:
lossless data compression: the data must be reconstructed exactly;
lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function.
This subset of Information theory is called rate–distortion theory.
Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error correcting code adds just the right kind of redundancy (i.e. error correction) needed to transmit the data efficiently and faithfully across a noisy channel.
This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts.
However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user.
In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal.
Network information theory refers to these multi-agent communication models.
Source theory
Any process that generates successive messages can be considered a source of information.
A memoryless source is one in which each message is an independent identically-distributed random variable, whereas the properties of ergodicity and stationarity impose more general constraints.
All such sources are stochastic.
These terms are well studied in their own right outside information theory.
Rate
Information rate is the average entropy per symbol.
For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is
that is, the conditional entropy of a symbol given all the previous symbols generated.
For the more general case of a process that is not necessarily stationary, the average rate is
that is, the limit of the joint entropy per symbol.
For stationary sources, these two expressions give the same result.
It is common in information theory to speak of the "rate" or "entropy" of a language.
This is appropriate, for example, when the source of information is English prose.
The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding.
Channel capacity
Communications over a channel—such as an ethernet wire—is the primary motivation of information theory.
As anyone who's ever used a telephone (mobile or landline) knows, however, such channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.
How much information can one hope to communicate over a noisy (or otherwise imperfect) channel?
Consider the communications process over a discrete channel.
A simple model of the process is shown below:
Here X represents the space of messages transmitted, and Y the space of messages received during a unit time over our channel.
Let be the conditional probability distribution function of Y given X.
We will consider to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel).
Then the joint distribution of X and Y is completely determined by our channel and by our choice of , the marginal distribution of messages we choose to send over the channel.
Under these constraints, we would like to maximize the rate of information, or the signal, we can communicate over the channel.
The appropriate measure for this is the mutual information, and this maximum mutual information is called the channel capacity and is given by:
This capacity has the following property related to communicating at information rate R (where R is usually bits per symbol).
For any information rate R < C and coding error ε > 0, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error.
In addition, for any rate R > C, it is impossible to transmit with arbitrarily small block error.
Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.
Channel capacity of particular model channels
A continuous-time analog communications channel subject to Gaussian noise — see Shannon–Hartley theorem.
A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that flips the input bit with probability p.
The BSC has a capacity of bits per channel use, where is the binary entropy function:
A binary erasure channel (BEC) with erasure probability p is a binary input, ternary output channel.
The possible channel outputs are 0, 1, and a third symbol 'e' called an erasure.
The erasure represents complete loss of information about an input bit.
The capacity of the BEC is 1 - p bits per channel use.
Applications to other fields
Intelligence uses and secrecy applications
Information theoretic concepts apply to cryptography and cryptanalysis.
Turing's information unit, the ban, was used in the Ultra project, breaking the German Enigma machine code and hastening the end of WWII in Europe.
Shannon himself defined an important concept now called the unicity distance.
Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
Information theory leads us to believe it is much more difficult to keep secrets than it might first appear.
A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers.
The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time.
Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks.
In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications.
In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key.
However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.
Pseudorandom number generation
Pseudorandom number generators are widely available in computer language libraries and application programs.
They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software.
A class of improved random number generators is termed Cryptographically secure pseudorandom number generators, but even they require external to the software random seeds to work as intended.
These can be obtained via extractors, if done carefully.
The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems.
Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
Miscellaneous applications
Information theory also has applications in gambling and investing, black holes, bioinformatics, and music.
Italian language
Italian (, or lingua italiana) is a Romance language spoken as a first language by about 63 million people, primarily in Italy.
In Switzerland, Italian is one of four official languages.
It is also the official language of San Marino.
It is the primary language of the Vatican City.
Standard Italian, adopted by the state after the unification of Italy, is based on Tuscan and is somewhat intermediate between Italo-Dalmatian languages of the South and Northern Italian dialects of the North.
Unlike most other Romance languages, Italian has retained the contrast between short and long consonants which existed in Latin.
As in most Romance languages, stress is distinctive.
Of the Romance languages, Italian is considered to be one of the closest resembling Latin in terms of vocabulary.
According to Ethnologue, lexical similarity is 89% with French, 87% with Catalan, 85% with Sardinian, 82% with Spanish, 78% with Rheto-Romance, and 77% with Romanian.
It is affectionately called il parlar gentile (the gentle language) by its speakers.
Writing system
Italian is written using the Latin alphabet.
The letters J, K, W, X and Y are not considered part of the standard Italian alphabet, but appear in loanwords (such as jeans, whisky, taxi).
X has become a commonly used letter in genuine Italian words with the prefix extra-.
J in Italian is an old-fashioned orthographic variant of I, appearing in the first name "Jacopo" as well as in some Italian place names, e.g., the towns of Bajardo, Bojano, Joppolo, Jesolo, Jesi, among numerous others, and in the alternate spelling Mar Jonio (also spelled Mar Ionio) for the Ionian Sea.
J may also appear in many words from different dialects, but its use is discouraged in contemporary Italian, and it is not part of the standard 21-letter contemporary Italian alphabet.
Each of these foreign letters had an Italian equivalent spelling: gi for j, c or ch for k, u or v for w (depending on what sound it makes), s, ss, or cs for x, and i for y.
Italian uses the acute accent over the letter E (as in perché, why/because) to indicate a front mid-close vowel, and the grave accent (as in tè, tea) to indicate a front mid-open vowel.
The grave accent is also used on letters A, I, O, and U to mark stress when it falls on the final vowel of a word (for instance gioventù, youth).
Typically, the penultimate syllable is stressed.
If syllables other than the last one are stressed, the accent is not mandatory, unlike in Spanish, and, in virtually all cases, it is omitted.
In some cases, when the word is ambiguous (as principi), the accent mark is sometimes used in order to disambiguate its meaning (in this case, prìncipi, princes, or princìpi, principles).
This is, however, not compulsory.
Rare words with three or more syllables can confuse Italians themselves, and the pronunciation of Istanbul is a common example of a word in which placement of stress is not clearly established.
Turkish, like French, tends to put the accent on ultimate syllable, but Italian doesn't.
So we can hear "Istànbul" or "Ìstanbul".
Another instance is the American State of Florida: the correct way to pronounce it in Italian is like in Spanish, "Florìda", but since there is an Italian word meaning the same ("flourishing"), "flòrida", and because of the influence of English, most Italians pronounce it that way.
Dictionaries give the latter as an alternative pronunciation.
The letter H at the beginning of a word is used to distinguish ho, hai, ha, hanno (present indicative of avere, 'to have') from o ('or'), ai ('to the'), a ('to'), anno ('year').
In the spoken language this letter is always silent for the cases given above.
H is also used in combinations with other letters (see below), but no phoneme exists in Italian.
In foreign words entered in common use, like "hotel" or "hovercraft", the H is commonly silent, so they are pronounced as and
The letter Z represents , for example: Zanzara (mosquito), or , for example: Nazione (nation), depending on context, though there are few minimal pairs.
The same goes for S, which can represent or .
However, these two phonemes are in complementary distribution everywhere except between two vowels in the same word, and even in such environment there are extremely few minimal pairs, so that this distinction is being lost in many varieties.
The letters C and G represent affricates: [[Voiceless postalveolar affricate|]] as in "chair" and [[Voiced postalveolar affricate|]] as in "gem", respectively, before the front vowels I and E.
They are pronounced as plosives , (as in "call" and "gall") otherwise.
Front/back vowel rules for C and G are similar in French, Romanian, Spanish, and to some extent English (including Old English).
Swedish and Norwegian have similar rules for K and G.
(See also palatalization.)
However, an H can be added between C or G and E or I to represent a plosive, and an I can be added between C or G and A, O or U to signal that the consonant is an affricate.
For example:
Note that the H is silent in the digraphs CH and GH, as also the I in cia, cio, ciu and even cie is not pronounced as a separate vowel, unless it carries the primary stress.
For example, it is silent in ciao and cielo , but it is pronounced in farmacia and farmacie .
There are three other special digraphs in Italian: GN, GL and SC.
GN represents [[Palatal nasal|]].
GL represents [[Palatal lateral approximant|]] only before i, and never at the beginning of a word, except in the personal pronoun and definite article gli.
(Compare with Spanish ñ and ll, Portuguese nh and lh.)
SC represents fricative [[Voiceless postalveolar fricative|]] before i or e.
Except in the speech of some Northern Italians, all of these are normally geminate between vowels.
In general, all letters or digraphs represent phonemes rather clearly, and, in standard varieties of Italian, there is little allophonic variation.
The most notable exceptions are assimilation of /n/ in point of articulation before consonants, assimilatory voicing of /s/ to following voiced consonants, and vowel length (vowels are long in stressed open syllables, and short elsewhere) — compare with the enormous number of allophones of the English phoneme /t/.
Spelling is clearly phonemic and difficult to mistake given a clear pronunciation.
Exceptions are generally only found in foreign borrowings.
There are fewer cases of dyslexia than among speakers of languages such as English , and the concept of a spelling bee is strange to Italians.
History
The history of the Italian language is long, but the modern standard of the language was largely shaped by relatively recent events.
The earliest surviving texts which can definitely be called Italian (or more accurately, vernacular, as opposed to its predecessor Vulgar Latin) are legal formulae from the region of Benevento dating from 960-963.
What would come to be thought of as Italian was first formalized in the first years of the 14th century through the works of Dante Alighieri, who mixed southern Italian languages, especially Sicilian, with his native Tuscan in his epic poems known collectively as the Commedia, to which Giovanni Boccaccio later affixed the title Divina.
Dante's much-loved works were read throughout Italy and his written dialect became the "canonical standard" that all educated Italians could understand.
Dante is still credited with standardizing the Italian language and, thus, the dialect of Tuscany became the basis for what would become the official language of Italy.
Italy has always had a distinctive dialect for each city since the cities were until recently thought of as city-states.
The latter now has considerable variety, however.
As Tuscan-derived Italian came to be used throughout the nation, features of local speech were naturally adopted, producing various versions of Regional Italian.
The most characteristic differences, for instance, between Roman Italian and Milanese Italian are the gemination of initial consonants and the pronunciation of stressed "e", and of "s" in some cases (e.g. va bene "all right": is pronounced by a Roman, by a Milanese; a casa "at home": Roman , Milanese ).
In contrast to the dialects of northern Italy, southern Italian dialects were largely untouched by the Franco-Occitan influences introduced to Italy, mainly by bards from France, during the Middle Ages.
Even in the case of Northern Italian dialects, however, scholars are careful not to overstate the effects of outsiders on the natural indigenous developments of the languages.
(See La Spezia-Rimini Line.)
The economic might and relative advanced development of Tuscany at the time (Late Middle Ages), gave its dialect weight, though Venetian remained widespread in medieval Italian commercial life.
Also, the increasing cultural relevance of Florence during the periods of 'Umanesimo (Humanism)' and the Rinascimento (Renaissance) made its volgare (dialect), or rather a refined version of it, a standard in the arts.
The re-discovery of Dante's De vulgari eloquentia and a renewed interest in linguistics in the 16th century sparked a debate which raged throughout Italy concerning which criteria should be chosen to establish a modern Italian standard to be used as much as a literary as a spoken language.
Scholars were divided into three factions: the purists, headed by Pietro Bembo who in his Gli Asolani claimed that the language might only be based on the great literary classics (notably, Petrarch, and Boccaccio but not Dante as Bembo believed that the Divine Comedy was not dignified enough as it used elements from other dialects), Niccolò Machiavelli and other Florentines who preferred the version spoken by ordinary people in their own times, and the Courtesans like Baldassarre Castiglione and Gian Giorgio Trissino who insisted that each local vernacular must contribute to the new standard.
Eventually Bembo's ideas prevailed, the result being the publication of the first Italian dictionary in 1612 and the foundation of the Accademia della Crusca in Florence (1582-3), the official legislative body of the Italian language.
Italian literature's first modern novel, I Promessi Sposi (The Betrothed), by Alessandro Manzoni further defined the standard by "rinsing" his Milanese 'in the waters of the Arno" (Florence's river), as he states in the Preface to his 1840 edition.
After unification a huge number of civil servants and soldiers recruited from all over the country introduced many more words and idioms from their home dialects ("ciao" is Venetian, "panettone" is Milanese etc.).
Classification
Italian is most closely related to the other two Italo-Dalmatian languages, Sicilian and the extinct Dalmatian.
The three are part of the Italo-Western grouping of the Romance languages, which are a subgroup of the Italic branch of Indo-European.
Geographic distribution
The total speakers of Italian as maternal language are between 60 and 70 million.
The speakers who use Italian as second or cultural language are estimated around 110-120 million .
Italian is the official language of Italy and San Marino, and one of the official languages of Switzerland, spoken mainly in Ticino and Grigioni cantons, a region referred to as Italian Switzerland.
It is also the second official language in some areas of Istria, in Slovenia and Croatia, where an Italian minority exists.
It is the primary language of the Vatican City and is widely used and taught in Monaco and Malta.
It is also widely understood in France with over one million speakers (especially in Corsica and the County of Nice, areas that historically spoke Italian dialects before annexation to France), and in Albania.
Italian is also spoken by some in former Italian colonies in Africa (Libya, Somalia and Eritrea).
However, its use has sharply dropped off since the colonial period.
In Eritrea Italian is widely understood .
In fact, for fifty years, during the colonial period, Italian was the language of instruction, but as of 1997, there is only one Italian language school remaining, with 470 pupils.
In Somalia Italian used to be a major language but due to the civil war and lack of education only the older generation still uses it.
Italian and Italian dialects are widely used by Italian immigrants and many of their descendants (see Italians) living throughout Western Europe (especially France, Germany, Belgium, Switzerland, the United Kingdom and Luxembourg), the United States, Canada, Australia, and Latin America (especially Uruguay, Brazil, Argentina, and Venezuela).
In the United States, Italian speakers are most commonly found in four cities: Boston (7,000), Chicago (12,000), New York City (140,000), and Philadelphia (15,000).
In Canada there are large Italian-speaking communities in Montreal (120,000) and Toronto (195,000).
Italian is the second most commonly-spoken language in Australia, where 353,605 Italian Australians, or 1.9% of the population, reported speaking Italian at home in the 2001 Census.
In 2001 there were 130,000 Italian speakers in Melbourne, and 90,000 in Sydney.
Italian language education
Italian is widely taught in many schools around the world, but rarely as the first non-native language of pupils; in fact, Italian generally is the fourth or fifth most taught second-language in the world.
In anglophone parts of Canada, Italian is, after French, the third most taught language.
In francophone Canada it is third after English.
In the United States and the United Kingdom, Italian ranks fourth (after Spanish-French-German and French-German-Spanish respectively).
Throughout the world, Italian is the fifth most taught non-native language, after English, French, Spanish, and German.
In the European Union, Italian is spoken as a mother tongue by 13% of the population (64 million, mainly in Italy itself) and as a second language by 3% (14 million); among EU member states, it is most likely to be desired (and therefore learned) as a second language in Malta (61%), Croatia (14%), Slovenia (12%), Austria (11%), Romania (8%), France (6%), and Greece (6%).
It is also an important second language in Albania and Switzerland, which are not EU members or candidates.
Influence and derived languages
From the late 19th to the mid 20th century, thousands of Italians settled in Argentina, Uruguay and southern Brazil, where they formed a very strong physical and cultural presence (see the Italian diaspora).
In some cases, colonies were established where variants of Italian dialects were used, and some continue to use a derived dialect.
An example is Rio Grande do Sul, Brazil, where Talian is used and in the town of Chipilo near Puebla, Mexico each continuing to use a derived form of Venetian dating back to the 19th century.
Another example is Cocoliche, an Italian-Spanish pidgin once spoken in Argentina and especially in Buenos Aires, and Lunfardo.
Rioplatense Spanish, and particularly the speech of the city of Buenos Aires, has intonation patterns that resemble those of Italian dialects, due to the fact that Argentina had a constant, large influx of Italian settlers since the second half of the nineteenth century; initially primarily from Northern Italy then, since the beginning of the twentieth century, mostly from Southern Italy.
Lingua Franca
Starting in late medieval times, Italian language variants replaced Latin to become the primary commercial language for much of Europe and Mediterranean Sea (especially the Tuscan and Venetian variants).
This became solidified during the Renaissance with the strength of Italian banking and the rise of humanism in the arts.
During the period of the Renaissance, Italy held artistic sway over the rest of Europe.
All educated European gentlemen were expected to make the Grand Tour, visiting Italy to see its great historical monuments and works of art.
It thus became expected that educated Europeans would learn at least some Italian; the English poet John Milton, for instance, wrote some of his early poetry in Italian.
In England, Italian became the second most common modern language to be learned, after French (though the classical languages, Latin and Greek, came first).
However, by the late eighteenth century, Italian tended to be replaced by German as the second modern language on the curriculum.
Yet Italian loanwords continue to be used in most other European languages in matters of art and music.
Today, the Italian language continues to be used as a lingua franca in some environments.
Within the Catholic church Italian is known by a large part of the ecclesiastic hierarchy, and is used in substitution of Latin in some official documents.
The presence of Italian as the primary language in the Vatican City indicates not only use within the Holy See, but also throughout the world where an episcopal seat is present.
It continues to be used in music and opera.
Other examples where Italian is sometimes used as a means communication is in some sports (sometimes in football and motorsports) and in the design and fashion industries.
Dialects
In Italy, all Romance languages spoken as the vernacular, other than standard Italian and other unrelated, non-Italian languages, are termed "Italian dialects".
Many Italian dialects are, in fact, historical languages in their own right.
These include recognized language groups such as Friulian, Neapolitan, Sardinian, Sicilian, Venetian, and others, and regional variants of these languages such as Calabrian.
The division between dialect and language has been used by scholars (such as by Francesco Bruni) to distinguish between the languages that made up the Italian koine, and those which had very little or no part in it, such as Albanian, Greek, German, Ladin, and Occitan, which are still spoken by minorities.
Dialects are generally not used for general mass communication and are usually limited to native speakers in informal contexts.
In the past, speaking in dialect was often deprecated as a sign of poor education.
Younger generations, especially those under 35 (though it may vary in different areas), speak almost exclusively standard Italian in all situations, usually with local accents and idioms.
Regional differences can be recognized by various factors: the openness of vowels, the length of the consonants, and influence of the local dialect (for example, annà replaces andare in the area of Rome for the infinitive "to go").
Sounds
Vowels
Italian has seven vowel phonemes: , , , , , , .
The pairs - and - are seldom distinguished in writing and often confused, even though most varieties of Italian employ both phonemes consistently.
Compare, for example: "perché" (why, because) and "senti" (you listen, you are listening, listen!), employed by some northern speakers, with and , as pronounced by most central and southern speakers.
As a result, the usage is strongly indicative of a person's origin.
The standard (Tuscan) usage of these vowels is listed in vocabularies, and employed outside Tuscany mainly by specialists, especially actors and very few (television) journalists.
These are truly different phonemes, however: compare (fishing) and (peach), both spelled pesca .
Similarly ('barrel') and ('beatings'), both spelled botte, discriminate and .
In general, vowel combinations usually pronounce each vowel separately.
Diphthongs exist (e.g. uo, iu, ie, ai), but are limited to an unstressed u or i before or after a stressed vowel.
The unstressed u in a diphthong approximates the English semivowel w, the unstressed i approximates the semivowel y.
E.g.: buono , ieri .
Triphthongs exist in Italian as well, like "continuiamo" ("we continue").
Three vowel combinations exist only in the form semiconsonant ( or ), followed by a vowel, followed by a desinence vowel (usually ), as in miei, suoi, or two semiconsonants followed by a vowel, as the group -uia- exemplified above, or -iuo- in the word aiuola.
Mobile diphthongs
Many Latin words with a short e or o have Italian counterparts with a mobile diphthong (ie and uo respectively).
When the vowel sound is stressed, it is pronounced and written as a diphthong; when not stressed, it is pronounced and written as a single vowel.
So Latin focus gave rise to Italian fuoco (meaning both "fire" and "optical focus"): when unstressed, as in focale ("focal") the "o" remains alone.
Latin pes (more precisely its accusative form pedem) is the source of Italian piede (foot): but unstressed "e" was left unchanged in pedone (pedestrian) and pedale (pedal).
From Latin iocus comes Italian giuoco ("play", "game"), though in this case gioco is more common: giocare means "to play (a game)".
From Latin homo comes Italian uomo (man), but also umano (human) and ominide (hominid).
From Latin ovum comes Italian uovo (egg) and ovaie (ovaries).
(The same phenomenon occurs in Spanish: juego (play, game) and jugar (to play), nieve (snow) and nevar (to snow)).
Consonants
Two symbols in a table cell denote the voiceless and voiced consonant, respectively.
Nasals undergo assimilation when followed by a consonant, e.g., when preceding a velar ( or ) only appears, etc.
Italian has geminate, or double, consonants, which are distinguished by length.
Length is distinctive for all consonants except for , , , , which are always geminate, and which is always single.
Geminate plosives and affricates are realised as lengthened closures.
Geminate fricatives, nasals, and are realized as lengthened continuants.
The flap consonant is typically dialectal, and it is called erre moscia.
The correct standard pronunciation is .
Of special interest to the linguistic study of Italian is the Gorgia Toscana, or "Tuscan Throat", the weakening or lenition of certain intervocalic consonants in Tuscan dialects.
See also Syntactic doubling.
Assimilation
Italian has few diphthongs, so most unfamiliar diphthongs that are heard in foreign words (in particular, those beginning with vowel "a", "e", or "o") will be assimilated as the corresponding diaeresis (i.e., the vowel sounds will be pronounced separately).
Italian phonotactics do not usually permit polysyllabic nouns and verbs to end with consonants, excepting poetry and song, so foreign words may receive extra terminal vowel sounds.
Grammar
Common variations in the writing systems
Some variations in the usage of the writing system may be present in practical use.
These are scorned by educated people, but they are so common in certain contexts that knowledge of them may be useful.
Usage of x instead of per: this is very common among teenagers and in SMS abbreviations.
The multiplication operator is pronounced "per" in Italian, and so it is sometimes used to replace the word "per", which means "for"; thus, for example, "per te" ("for you") is shortened to "x te" (compare with English "4 U").
Words containing per can also have it replaced with x: for example, perché (both "why" and "because") is often shortened as xché or xké or x' (see below).
This usage might be useful to jot down quick notes or to fit more text into the low character limit of an SMS, but it is considered unacceptable in formal writing.
Usage of foreign letters such as k, j and y, especially in nicknames and SMS language: ke instead of che, Giusy instead of Giuseppina (or sometimes Giuseppe).
This is curiously mirrored in the usage of i in English names such as Staci instead of Stacey, or in the usage of c in Northern Europe (Jacob instead of Jakob).
The use of "k" instead of "ch" or "c" to represent a plosive sound is documented in some historical texts from before the standardization of the Italian language; however, that usage is no longer standard in Italian.
Possibly because it is associated with the German language, the letter "k" has sometimes also been used in satire to suggest that a political figure is an authoritarian or even a "pseudo-nazi": Francesco Cossiga was famously nicknamed Kossiga by rioting students during his tenure as minister of internal affairs.
[Cf. the politicized spelling Amerika in the USA.]
Usage of the following abbreviations is limited to the electronic communications media and is deprecated in all other cases: nn instead of non (not), cmq instead of comunque (anyway, however), cm instead of come (how, like, as), d instead of di (of), (io/loro) sn instead of (io/loro) sono (I am/they are), (io) dv instead of (io) devo (I must/I have to) or instead of dove (where), (tu) 6 instead of (tu) sei (you are).
Inexperienced typists often replace accents with apostrophes, such as in perche' instead of perché.
Uppercase È is particularly rare, as it is absent from the Italian keyboard layout, and is very often written as E (even though there are several ways of producing the uppercase È on a computer).
This never happens in books or other professionally typeset material.
Samples
Examples
Cheers: "Salute!"
English: inglese
Good-bye: arrivederci
Hello: ciao
Good day: buon giorno
Good evening: buona sera
Yes: sì
No: no
How are you? : Come stai (informal); Come sta (formal)
Sorry: mi dispiace
Excuse me: scusa (informal); scusi (formal)
Again: di nuovo, //; ancora //
Always: sempre //
When: quando
Where: dove
Why/Because: perché
How: come
How much is it?: quanto costa?
Thank you!: grazie!
Bon appetit: buon appetito
You're welcome!: prego!
I love you: Ti amo , Ti voglio bene .
The difference is that you use "Ti amo" when you are in a romantic relationship, "Ti voglio bene" in any other occasion (to parents, to relatives, to friends...)
Counting to twenty:
One: uno
Two: due
Three: tre
Four: quattro
Five: cinque
Six: sei
Seven: sette
Eight: otto
Nine: nove
Ten: dieci
Eleven: undici
Twelve: dodici
Thirteen: tredici
Fourteen: quattordici
Fifteen: quindici
Sixteen: sedici
Seventeen: diciassette
Eighteen: diciotto
Nineteen: diciannove
Twenty: venti
The days of the week:
Monday: lunedì
Tuesday: martedì
Wednesday: mercoledì
Thursday: giovedì
Friday: venerdì
Saturday: sabato
Sunday: domenica
Sample texts
There is a recording of Dante's Divine Comedy read by Lino Pertile available at http://etcweb.princeton.edu/dante/pdp/
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities.
It is related to the Ryukyuan languages, but whatever relationships with other languages it may have remain undemonstrated.
It is an agglutinative language and is distinguished by a complex system of honorifics reflecting the hierarchical nature of Japanese society, with verb forms and particular vocabulary to indicate the relative status of speaker, listener and the third person mentioned in conversation whether he is there or not.
The sound inventory of Japanese is relatively small, and it has a lexically distinct pitch-accent system.
It is a mora-timed language.
The Japanese language is written with a combination of three different types of scripts: Chinese characters called kanji (漢字 / かんじ), and two syllabic scripts made up of modified Chinese characters, hiragana (平仮名 / ひらがな) and katakana (片仮名 / カタカナ).
The Latin alphabet, rōmaji (ローマ字), is also often used in modern Japanese, especially for company names and logos, advertising, and when entering Japanese text into a computer.
Western style Arabic numerals are generally used for numbers, but traditional Sino-Japanese numerals are also commonplace.
Japanese vocabulary has been heavily influenced by loanwords from other languages.
A vast number of words were borrowed from Chinese, or created from Chinese models, over a period of at least 1,500 years.
Since the late 19th century, Japanese has borrowed a considerable number of words from Indo-European languages, primarily English.
Because of the special trade relationship between Japan and first Portugal in the 16th century, and then mainly the Netherlands in the 17th century, Portuguese, German and Dutch have also been influential.
Geographic distribution
Although Japanese is spoken almost exclusively in Japan, it has been and sometimes still is spoken elsewhere.
When Japan occupied Korea, Taiwan, parts of the Chinese mainland, and various Pacific islands before and during World War II, locals in those countries were forced to learn Japanese in empire-building programs.
As a result, there are many people in these countries who can speak Japanese in addition to the local languages.
Japanese emigrant communities (the largest of which are to be found in Brazil) sometimes employ Japanese as their primary language.
Approximately 5% of Hawaii residents speak Japanese, with Japanese ancestry the largest single ancestry in the state (over 24% of the population).
Japanese emigrants can also be found in Peru, Argentina, Australia (especially Sydney, Brisbane, and Melbourne), the United States (notably California, where 1.2% of the population has Japanese ancestry, and Hawaii), and the Philippines (particularly in Davao and Laguna).
Their descendants, who are known as (, literally Japanese descendants), however, rarely speak Japanese fluently after the second generation.
There are estimated to be several million non-Japanese studying the language as well.
Official status
Japanese is the de facto official language of Japan.
There is a form of the language considered standard: Standard Japanese, or the common language.
The meanings of the two terms are almost the same.
or is a conception that forms the counterpart of dialect.
This normative language was born after the from the language spoken in uptown Tokyo for communicating necessity.
is taught in schools and used on television and in official communications, and is the version of Japanese discussed in this article.
Formerly, standard was different from .
The two systems have different rules of grammar and some variance in vocabulary.
was the main method of writing Japanese until about 1900; since then gradually extended its influence and the two methods were both used in writing until the 1940s.
still has some relevance for historians, literary scholars, and lawyers (many Japanese laws that survived World War II are still written in , although there are ongoing efforts to modernize their language).
is the predominant method of both speaking and writing Japanese today, although grammar and vocabulary are occasionally used in modern Japanese for effect.
Dialects
Dozens of dialects are spoken in Japan.
The profusion is due to many factors, including the length of time the archipelago has been inhabited, its mountainous island terrain, and Japan's long history of both external and internal isolation.
Dialects typically differ in terms of pitch accent, inflectional morphology, vocabulary, and particle usage.
Some even differ in vowel and consonant inventories, although this is uncommon.
The main distinction in Japanese accents is between and , though Kyūshū-type dialects form a third, smaller group.
Within each type are several subdivisions.
Kyoto-Osaka-type dialects are in the central region, with borders roughly formed by Toyama, Kyōto, Hyōgo, and Mie Prefectures; most Shikoku dialects are also that type.
The final category of dialects are those that are descended from the Eastern dialect of Old Japanese; these dialects are spoken in Hachijō-jima island and few islands.
Dialects from peripheral regions, such as Tōhoku or Tsushima, may be unintelligible to speakers from other parts of the country.
The several dialects of Kagoshima in southern Kyūshū are famous for being unintelligible not only to speakers of standard Japanese but to speakers of nearby dialects elsewhere in Kyūshū as well.
This is probably due in part to the Kagoshima dialects' peculiarities of pronunciation, which include the existence of closed syllables (i.e., syllables that end in a consonant, such as or for Standard Japanese "spider").
A dialects group of Kansai is spoken and known by many Japanese, and Osaka dialect in particular is associated with comedy (See Kansai dialect).
Dialects of Tōhoku and North Kantō are associated with typical farmers.
The Ryūkyūan languages, spoken in Okinawa and Amami Islands that are politically part of Kagoshima, are distinct enough to be considered a separate branch of the Japonic family.
But many Japanese common people tend to consider the Ryūkyūan languages as dialects of Japanese.
Not only is each language unintelligible to Japanese speakers, but most are unintelligible to those who speak other Ryūkyūan languages.
Recently, Standard Japanese has become prevalent nationwide (including the Ryūkyū islands) due to education, mass media, and increase of mobility networks within Japan, as well as economic integration.
Sounds
Japanese vowels are "pure" sounds.
The only unusual vowel is the high back vowel , which is like , but compressed instead of rounded.
Japanese has five vowels, and vowel length is phonemic, so each one has both a short and a long version.
Some Japanese consonants have several allophones, which may give the impression of a larger inventory of sounds.
However, some of these allophones have since become phonemic.
For example, in the Japanese language up to and including the first half of the twentieth century, the phonemic sequence was palatalized and realized phonetically as , approximately chi ; however, now and are distinct, as evidenced by words like tī "Western style tea" and chii "social status."
The 'r' of the Japanese language (technically a lateral apical postalveolar flap), is of particular interest, sounding to most English speakers to be something between an 'l' and a retroflex 'r' depending on its position in a word.
The syllabic structure and the phonotactics are very simple: the only consonant clusters allowed within a syllable consist of one of a subset of the consonants plus .
These type of clusters only occur in onsets.
However, consonant clusters across syllables are allowed as long as the two consonants are a nasal followed by a homo-organic consonant.
Consonant length (gemination) is also phonemic.
Grammar
Sentence structure
Japanese word order is classified as Subject Object Verb.
However, unlike many Indo-European languages, Japanese sentences only require that verbs come last for intelligibility.
This is because the Japanese sentence elements are marked with particles that identify their grammatical functions.
The basic sentence structure is topic-comment.
For example, ().
("this") is the topic of the sentence, indicated by the particle -wa.
The verb is , a copula, commonly translated as "to be" or "it is" (though there are other verbs that can be translated as "to be").
As a phrase, is the comment.
This sentence loosely translates to "As for this person, (it) is Mr./Mrs./Miss Tanaka."
Thus Japanese, like Chinese, Korean, and many other Asian languages, is often called a topic-prominent language, which means it has a strong tendency to indicate the topic separately from the subject, and the two do not always coincide.
The sentence () literally means, "As for elephants, (their) noses are long".
The topic is "elephant", and the subject is "nose".
Japanese is a pro-drop language, meaning that the subject or object of a sentence need not be stated if it is obvious from context.
In addition, it is commonly felt, particularly in spoken Japanese, that the shorter a sentence is, the better.
As a result of this grammatical permissiveness and tendency towards brevity, Japanese speakers tend naturally to omit words from sentences, rather than refer to them with pronouns.
In the context of the above example, would mean "[their] noses are long," while by itself would mean "[they] are long."
A single verb can be a complete sentence:
"[I / we / they / etc] did [it]!".
In addition, since adjectives can form the predicate in a Japanese sentence (below), a single adjective can be a complete sentence:
"[I'm] jealous [of it]!".
While the language has some words that are typically translated as pronouns, these are not used as frequently as pronouns in some Indo-European languages, and function differently.
Instead, Japanese typically relies on special verb forms and auxiliary verbs to indicate the direction of benefit of an action: "down" to indicate the out-group gives a benefit to the in-group; and "up" to indicate the in-group gives a benefit to the out-group.
Here, the in-group includes the speaker and the out-group doesn't, and their boundary depends on context.
For example, (literally, "explained" with a benefit from the out-group to the in-group) means "[he/she/they] explained it to [me/us]".
Similarly, (literally, "explained" with a benefit from the in-group to the out-group) means "[I/we] explained [it] to [him/her/them]".
Such beneficiary auxiliary verbs thus serve a function comparable to that of pronouns and prepositions in Indo-European languages to indicate the actor and the recipient of an action.
Japanese "pronouns" also function differently from most modern Indo-European pronouns (and more like nouns) in that they can take modifiers as any other noun may.
For instance, one cannot say in English:
*The amazed he ran down the street. (grammatically incorrect)
But one can grammatically say essentially the same thing in Japanese:
(grammatically correct)
This is partly due to the fact that these words evolved from regular nouns, such as "you" ( "lord"), "you" ( "that side, yonder"), and "I" ( "servant").
This is why some linguists do not classify Japanese "pronouns" as pronouns, but rather as referential nouns.
Japanese personal pronouns are generally used only in situations requiring special emphasis as to who is doing what to whom.
The choice of words used as pronouns is correlated with the sex of the speaker and the social situation in which they are spoken: men and women alike in a formal situation generally refer to themselves as ( "private") or (also ), while men in rougher or intimate conversation are much more likely to use the word ( "oneself", "myself") or .
Similarly, different words such as , , and (, more formally "the one before me") may be used to refer to a listener depending on the listener's relative social position and the degree of familiarity between the speaker and the listener.
When used in different social relationships, the same word may have positive (intimate or respectful) or negative (distant or disrespectful) connotations.
Japanese often use titles of the person referred to where pronouns would be used in English.
For example, when speaking to one's teacher, it is appropriate to use (, teacher), but inappropriate to use .
This is because is used to refer to people of equal or lower status, and one's teacher has allegedly higher status.
For English speaking learners of Japanese, a frequent beginners mistake is to include or at the beginning of sentences as one would with I or you in English.
Though these sentences are not grammatically incorrect, even in formal settings it would be considered unnatural and would equate in English to repeatedly using a noun where a pronoun would suffice.
Inflection and conjugation
Japanese nouns have no grammatical number, gender or article aspect.
The noun () may refer to a single book or several books; () can mean "person" or "people"; and () can be "tree" or "trees".
Where number is important, it can be indicated by providing a quantity (often with a counter word) or (rarely) by adding a suffix.
Words for people are usually understood as singular.
Thus usually means Mr./Mrs./Miss. Tanaka.
Words that refer to people and animals can be made to indicate a group of individuals through the addition of a collective suffix (a noun suffix that indicates a group), such as , but this is not a true plural: the meaning is closer to the English phrase "and company".
A group described as may include people not named Tanaka.
Some Japanese nouns are effectively plural, such as "people" and "we/us", while the word "friend" is considered singular, although plural in form.
Verbs are conjugated to show tenses, of which there are two: past and present, or non-past, which is used for the present and the future.
For verbs that represent an ongoing process, the -te iru form indicates a continuous (or progressive) tense.
For others that represent a change of state, the form indicates a perfect tense.
For example, means "He has come (and is still here)", but means "He is eating".
Questions (both with an interrogative pronoun and yes/no questions) have the same structure as affirmative sentences, but with intonation rising at the end.
In the formal register, the question particle is added.
For example, () "It is OK" becomes () "Is it OK?".
In a more informal tone sometimes the particle () is added instead to show a personal interest of the speaker:
"Why aren't (you) coming?".
Some simple queries are formed simply by mentioning the topic with an interrogative intonation to call for the hearer's attention:
"(What about) this?"; () "(What's your) name?".
Negatives are formed by inflecting the verb.
For example, () "I will eat bread" or "I eat bread" becomes () "I will not eat bread" or "I do not eat bread".
The so-called verb form is used for a variety of purposes: either progressive or perfect aspect (see above); combining verbs in a temporal sequence ( "I'll eat breakfast and leave at once"), simple commands, conditional statements and permissions ( "May I go out?"), etc.
The word (plain), (polite) is the copula verb.
It corresponds approximately to the English be, but often takes on other roles, including a marker for tense, when the verb is conjugated into its past form (plain), (polite).
This comes into use because only adjectives and verbs can carry tense in Japanese.
Two additional common verbs are used to indicate existence ("there is") or, in some contexts, property: (negative ) and (negative ), for inanimate and animate things, respectively.
For example, "There's a cat", "[I] haven't got a good idea".
Note that the negative forms of the verbs and are actually i-adjectives and inflect as such, e.g. "There was no cat".
The verb "to do" (, polite form ) is often used to make verbs from nouns ( "to cook", "to study", etc.) and has been productive in creating modern slang words.
Japanese also has a huge number of compound verbs to express concepts that are described in English using a verb and a preposition (e.g. "to fly out, to flee," from "to fly, to jump" + "to put out, to emit").
There are three types of adjective (see also Japanese adjectives):
, or adjectives, which have a conjugating ending () (such as "to be hot") which can become past ( "it was hot"), or negative ( "it is not hot").
Note that is also an adjective, which can become past ( "it was not hot").
"a hot day".
, or adjectives, which are followed by a form of the copula, usually .
For example (strange)
"a strange person".
, also called true adjectives, such as "that"
"that mountain".
Both and may predicate sentences.
For example,
"The rice is hot."
"He's strange."
Both inflect, though they do not show the full range of conjugation found in true verbs.
The in Modern Japanese are few in number, and unlike the other words, are limited to directly modifying nouns.
They never predicate sentences.
Examples include "big", "this", "so-called" and "amazing".
Both and form adverbs, by following with in the case of :
"become strange",
and by changing to in the case of :
"become hot".
The grammatical function of nouns is indicated by postpositions, also called particles.
These include for example:
for the nominative case.
Not necessarily a subject.
"He did it."
for the dative case.
"Please give it to Mr. Tanaka."
It is also used for the lative case, indicating a motion to a location.
"I want to go to Japan."
for the genitive case, or nominalizing phrases.
"my camera"
"(I) like going skiing."
for the accusative case.
Not necessarily an object.
"What will (you) eat?"
for the topic.
It can co-exist with case markers above except , and it overrides and .
"As for me, Thai food is good."
The nominative marker after is hidden under .
(Note that English generally makes no distinction between sentence topic and subject.)
Note: The difference between and goes beyond the English distinction between sentence topic and subject.
While indicates the topic, which the rest of the sentence describes or acts upon, it carries the implication that the subject indicated by is not unique, or may be part of a larger group.
"As for Mr. Ikeda, he is forty-two years old."
Others in the group may also be of that age.
Absence of often means the subject is the focus of the sentence.
"It is Mr. Ikeda who is forty-two years old."
This is a reply to an implicit or explicit question who in this group is forty-two years old.
Politeness
Unlike most western languages, Japanese has an extensive grammatical system to express politeness and formality.
Most relationships are not equal in Japanese society.
The differences in social position are determined by a variety of factors including job, age, experience, or even psychological state (e.g., a person asking a favour tends to do so politely).
The person in the lower position is expected to use a polite form of speech, whereas the other might use a more plain form.
Strangers will also speak to each other politely.
Japanese children rarely use polite speech until they are teens, at which point they are expected to begin speaking in a more adult manner.
See uchi-soto.
Whereas () (polite language) is commonly an inflectional system, () (respectful language) and () (humble language) often employ many special honorific and humble alternate verbs: "go" becomes in polite form, but is replaced by in honorific speech and or in humble speech.
The difference between honorific and humble speech is particularly pronounced in the Japanese language.
Humble language is used to talk about oneself or one's own group (company, family) whilst honorific language is mostly used when describing the interlocutor and his/her group.
For example, the suffix ("Mr" "Mrs." or "Miss") is an example of honorific language.
It is not used to talk about oneself or when talking about someone from one's company to an external person, since the company is the speaker's "group".
When speaking directly to one's superior in one's company or when speaking with other employees within one's company about a superior, a Japanese person will use vocabulary and inflections of the honorific register to refer to the in-group superior and his or her speech and actions.
When speaking to a person from another company (i.e., a member of an out-group), however, a Japanese person will use the plain or the humble register to refer to the speech and actions of his or her own in-group superiors.
In short, the register used in Japanese to refer to the person, speech, or actions of any particular individual varies depending on the relationship (either in-group or out-group) between the speaker and listener, as well as depending on the relative status of the speaker, listener, and third-person referents.
For this reason, the Japanese system for explicit indication of social register is known as a system of "relative honorifics."
This stands in stark contrast to the Korean system of "absolute honorifics," in which the same register is used to refer to a particular individual (e.g. one's father, one's company president, etc.) in any context regardless of the relationship between the speaker and interlocutor.
Thus, polite Korean speech can sound very presumptuous when translated verbatim into Japanese, as in Korean it is acceptable and normal to say things like "Our Mr. Company-President..." when communicating with a member of an out-group, which would be very inappropriate in a Japanese social context.
Most nouns in the Japanese language may be made polite by the addition of or as a prefix.
is generally used for words of native Japanese origin, whereas is affixed to words of Chinese derivation.
In some cases, the prefix has become a fixed part of the word, and is included even in regular speech, such as 'cooked rice; meal.'
Such a construction often indicates deference to either the item's owner or to the object itself.
For example, the word 'friend,' would become when referring to the friend of someone of higher status (though mothers often use this form to refer to their children's friends).
On the other hand, a polite speaker may sometimes refer to 'water' as in order to show politeness.
Most Japanese people employ politeness to indicate a lack of familiarity.
That is, they use polite forms for new acquaintances, but if a relationship becomes more intimate, they no longer use them.
This occurs regardless of age, social class, or gender.
Vocabulary
The original language of Japan, or at least the original language of a certain population that was ancestral to a significant portion of the historical and present Japanese nation, was the so-called ( or infrequently , i.e. "Yamato words"), which in scholarly contexts is sometimes referred to as ( or rarely , i.e. the words").
In addition to words from this original language, present-day Japanese includes a great number of words that were either borrowed from Chinese or constructed from Chinese roots following Chinese patterns.
These words, known as (), entered the language from the fifth century onwards via contact with Chinese culture.
According to a Japanese dictionary Shinsen-kokugojiten (新選国語辞典), Chinese-based words comprise 49.1% of the total vocabulary, Wago is 33.8% and other foreign words are 8.8%.
Like Latin-derived words in English, words typically are perceived as somewhat formal or academic compared to equivalent Yamato words.
Indeed, it is generally fair to say that an English word derived from Latin/French roots typically corresponds to a Sino-Japanese word in Japanese, whereas a simpler Anglo-Saxon word would best be translated by a Yamato equivalent.
A much smaller number of words has been borrowed from Korean and Ainu.
Japan has also borrowed a number of words from other languages, particularly ones of European extraction, which are called .
This began with borrowings from Portuguese in the 16th century, followed by borrowing from Dutch during Japan's long isolation of the Edo period.
With the Meiji Restoration and the reopening of Japan in the 19th century, borrowing occurred from German, French and English.
Currently, words of English origin are the most commonly borrowed.
In the Meiji era, the Japanese also coined many neologisms using Chinese roots and morphology to translate Western concepts.
The Chinese and Koreans imported many of these pseudo-Chinese words into Chinese, Korean, and Vietnamese via their kanji in the late 19th and early 20th centuries.
For example, ("politics"), and ("chemistry") are words derived from Chinese roots that were first created and used by the Japanese, and only later borrowed into Chinese and other East Asian languages.
As a result, Japanese, Chinese, Korean, and Vietnamese share a large common corpus of vocabulary in the same way a large number of Greek- and Latin-derived words are shared among modern European languages, although many academic words formed from such roots were certainly coined by native speakers of other languages, such as English.
In the past few decades, (made-in-Japan English) has become a prominent phenomenon.
Words such as (< one + pattern, "to be in a rut", "to have a one-track mind") and (< skin + -ship, "physical contact"), although coined by compounding English roots, are nonsensical in most non-Japanese contexts; exceptions exist in nearby languages such as Korean however, which often use words such as skinship and rimokon (remote control) in the same way as in Japanese.
Additionally, many native Japanese words have become commonplace in English, due to the popularity of many Japanese cultural exports.
Words such as futon, haiku, judo, kamikaze, karaoke, karate, ninja, origami, rickshaw (from ), samurai, sayonara, sumo, sushi, tsunami, tycoon and many others have become part of the English language.
See list of English words of Japanese origin for more.
Writing system
Literacy was introduced to Japan in the form of the Chinese writing system, by way of Baekje before the 5th century.
Using this language, the Japanese emperor Yūryaku sent a letter to a Chinese emperor Liu Song in 478 CE.
After the ruin of Baekje, Japan invited scholars from China to learn more of the Chinese writing system.
Japanese Emperors gave an official rank to Chinese scholars (続守言/薩弘格/袁晋卿) and spread the use of Chinese characters from the 7th century to the 8th century.
At first, the Japanese wrote in Classical Chinese, with Japanese names represented by characters used for their meanings and not their sounds.
Later, during the seventh century CE, the Chinese-sounding phoneme principle was used to write pure Japanese poetry and prose (comparable to Akkadian's retention of Sumerian cuneiform), but some Japanese words were still written with characters for their meaning and not the original Chinese sound.
This is when the history of Japanese as a written language begins in its own right.
By this time, the Japanese language was already distinct from the Ryukyuan languages.
The Korean settlers and their descendants used Kudara-on or Baekje pronunciation (百済音), which was also called Tsushima-pronunciation (対馬音) or Go-on (呉音).
An example of this mixed style is the Kojiki, which was written in 712 AD.
They then started to use Chinese characters to write Japanese in a style known as , a syllabic script which used Chinese characters for their sounds in order to transcribe the words of Japanese speech syllable by syllable.
Over time, a writing system evolved.
Chinese characters (kanji) were used to write either words borrowed from Chinese, or Japanese words with the same or similar meanings.
Chinese characters were also used to write grammatical elements, were simplified, and eventually became two syllabic scripts: hiragana and katakana.
Modern Japanese is written in a mixture of three main systems: kanji, characters of Chinese origin used to represent both Chinese loanwords into Japanese and a number of native Japanese morphemes; and two syllabaries: hiragana and katakana.
The Latin alphabet is also sometimes used.
Arabic numerals are much more common than the kanji when used in counting, but kanji numerals are still used in compounds, such as ("unification").
Hiragana are used for words without kanji representation, for words no longer written in kanji, and also following kanji to show conjugational endings.
Because of the way verbs (and adjectives) in Japanese are conjugated, kanji alone cannot fully convey Japanese tense and mood, as kanji cannot be subject to variation when written without losing its meaning.
For this reason, hiragana are suffixed to the ends of kanji to show verb and adjective conjugations.
Hiragana used in this way are called okurigana.
Hiragana are also written in a superscript called furigana above or beside a kanji to show the proper reading.
This is done to facilitate learning, as well as to clarify particularly old or obscure (or sometimes invented) readings.
Katakana, like hiragana, are a syllabary; katakana are primarily used to write foreign words, plant and animal names, and for emphasis.
For example "Australia" has been adapted as (), and "supermarket" has been adapted and shortened into ().
The Latin alphabet (in Japanese referred to as Rōmaji (), literally "Roman letters") is used for some loan words like "CD" and "DVD", and also for some Japanese creations like "Sony".
Historically, attempts to limit the number of kanji in use commenced in the mid-19th century, but did not become a matter of government intervention until after Japan's defeat in the Second World War.
During the period of post-war occupation (and influenced by the views of some U.S. officials), various schemes including the complete abolition of kanji and exclusive use of rōmaji were considered.
The ("common use kanji", originally called [kanji for general use]) scheme arose as a compromise solution.
Japanese students begin to learn kanji from their first year at elementary school.
A guideline created by the Japanese Ministry of Education, the list of ("education kanji", a subset of ), specifies the 1,006 simple characters a child is to learn by the end of sixth grade.
Children continue to study another 939 characters in junior high school, covering in total 1,945 .
The official list of was revised several times, but the total number of officially sanctioned characters remained largely unchanged.
As for kanji for personal names, the circumstances are somewhat complicated.
and (an appendix of additional characters for names) are approved for registering personal names.
Names containing unapproved characters are denied registration.
However, as with the list of , criteria for inclusion were often arbitrary and led to many common and popular characters being disapproved for use.
Under popular pressure and following a court decision holding the exclusion of common characters unlawful, the list of was substantially extended from 92 in 1951 (the year it was first decreed) to 983 in 2004.
Furthermore, families whose names are not on these lists were permitted to continue using the older forms.
Many writers rely on newspaper circulation to publish their work with officially sanctioned characters.
This distribution method is more efficient than traditional pen and paper publications.
Study by non-native speakers
Many major universities throughout the world provide Japanese language courses, and a number of secondary and even primary schools worldwide offer courses in the language.
International interest in the Japanese language dates from the 1800s but has become more prevalent following Japan's economic bubble of the 1980s and the global popularity of Japanese pop culture (such as anime and video games) since the 1990s.
About 2.3 million people studied the language worldwide in 2003: 900,000 South Koreans, 389,000 Chinese, 381,000 Australians, and 140,000 Americans study Japanese in lower and higher educational institutions.
In Japan, more than 90,000 foreign students study at Japanese universities and Japanese language schools, including 77,000 Chinese and 15,000 South Koreans in 2003.
In addition, local governments and some NPO groups provide free Japanese language classes for foreign residents, including Japanese Brazilians and foreigners married to Japanese nationals.
In the United Kingdom, studies are supported by the British Association for Japanese Studies.
In Ireland, Japanese is offered as a language in the Leaving Certificate in some schools.
The Japanese government provides standardised tests to measure spoken and written comprehension of Japanese for second language learners; the most prominent is the Japanese Language Proficiency Test (JLPT).
The Japanese External Trade Organisation JETRO organises the Business Japanese Proficiency Test which tests the learner's ability to understand Japanese in a business setting.
When learning Japanese in a college setting, students are usually first taught how to pronounce romaji.
From that point, they are taught the two main syllabaries, with kanji usually being introduced in the second semester.
Focus is usually first on polite (distal) speech, as students that might interact with native speakers would be expected to use.
Casual speech and formal speech usually follow polite speech, as well as the usage of honourifics.
Java (programming language)
Java is a programming language originally developed by Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform.
The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities.
Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of computer architecture.
The original and reference implementation Java compilers, virtual machines, and class libraries were developed by Sun from 1995.
As of May 2007, in compliance with the specifications of the Java Community Process, Sun made available most of their Java technologies as free software under the GNU General Public License.
Others have also developed alternative implementations of these Sun technologies, such as the GNU Compiler for Java and GNU Classpath.
History
The Java language was created by James Gosling in June 1991 for use in one of his many set-top box projects.
The language was initially called Oak, after an oak tree that stood outside Gosling's office—and also went by the name Green—and ended up later being renamed to Java, from a list of random words.
Gosling's goals were to implement a virtual machine and a language that had a familiar C/C++ style of notation.
The first public implementation was Java 1.0 in 1995.
It promised "Write Once, Run Anywhere" (WORA), providing no-cost runtimes on popular platforms.
It was fairly secure and its security was configurable, allowing network and file access to be restricted.
Major web browsers soon incorporated the ability to run secure Java applets within web pages.
Java quickly became popular.
With the advent of Java 2, new versions had multiple configurations built for different types of platforms.
For example, J2EE was for enterprise applications and the greatly stripped down version J2ME was for mobile applications.
J2SE was the designation for the Standard Edition.
In 2006, for marketing purposes, new J2 versions were renamed Java EE, Java ME, and Java SE, respectively.
In 1997, Sun Microsystems approached the ISO/IEC JTC1 standards body and later the Ecma International to formalize Java, but it soon withdrew from the process.
Java remains a de facto standard that is controlled through the Java Community Process.
At one time, Sun made most of its Java implementations available without charge although they were proprietary software.
Sun's revenue from Java was generated by the selling of licenses for specialized products such as the Java Enterprise System.
Sun distinguishes between its Software Development Kit (SDK) and Runtime Environment (JRE) that is a subset of the SDK, the primary distinction being that in the JRE, the compiler, utility programs, and many necessary header files are not present.
On 13 November 2006, Sun released much of Java as free and open-source software under the terms of the GNU General Public License (GPL).
On 8 May 2007 Sun finished the process, making all of Java's core code free and open-source, aside from a small portion of code to which Sun did not hold the copyright.
Philosophy
Primary goals
There were five primary goals in the creation of the Java language:
It should use the object-oriented programming methodology.
It should allow the same program to be executed on multiple operating systems.
It should contain built-in support for using computer networks.
It should be designed to execute code from remote sources securely.
It should be easy to use by selecting what were considered the good parts of other object-oriented languages.
Platform independence
One characteristic, platform independence, means that programs written in the Java language must run similarly on any supported hardware/operating-system platform.
One should be able to write a program once, compile it once, and run it anywhere.
This is achieved by most Java compilers by compiling the Java language code halfway (to Java bytecode) – simplified machine instructions specific to the Java platform.
The code is then run on a virtual machine (VM), a program written in native code on the host hardware that interprets and executes generic Java bytecode.
(In some JVM versions, bytecode can also be compiled to native code, either before or during program execution, resulting in faster execution.)
Further, standardized libraries are provided to allow access to features of the host machines (such as graphics, threading and networking) in unified ways.
Note that, although there is an explicit compiling stage, at some point, the Java bytecode is interpreted or converted to native machine code by the JIT compiler.
The first implementations of the language used an interpreted virtual machine to achieve portability.
These implementations produced programs that ran slower than programs compiled to native executables, for instance written in C or C++, so the language suffered a reputation for poor performance.
More recent JVM implementations produce programs that run significantly faster than before, using multiple techniques.
One technique, known as just-in-time compilation (JIT), translates the Java bytecode into native code at the time that the program is run, which results in a program that executes faster than interpreted code but also incurs compilation overhead during execution.
More sophisticated VMs use dynamic recompilation, in which the VM can analyze the behavior of the running program and selectively recompile and optimize critical parts of the program.
Dynamic recompilation can achieve optimizations superior to static compilation because the dynamic compiler can base optimizations on knowledge about the runtime environment and the set of loaded classes, and can identify the hot spots (parts of the program, often inner loops, that take up the most execution time).
JIT compilation and dynamic recompilation allow Java programs to take advantage of the speed of native code without losing portability.
Another technique, commonly known as static compilation, is to compile directly into native code like a more traditional compiler.
Static Java compilers, such as GCJ, translate the Java language code to native object code, removing the intermediate bytecode stage.
This achieves good performance compared to interpretation, but at the expense of portability; the output of these compilers can only be run on a single architecture.
Some see avoiding the VM in this manner as defeating the point of developing in Java; however it can be useful to provide both a generic bytecode version, as well as an optimised native code version of an application.
Implementations
Sun Microsystems officially licenses the Java Standard Edition platform for Microsoft Windows, Linux, and Solaris.
Through a network of third-party vendors and licensees, alternative Java environments are available for these and other platforms.
To qualify as a certified Java licensee, an implementation on any particular platform must pass a rigorous suite of validation and compatibility tests.
This method enables a guaranteed level of compliance and platform through a trusted set of commercial and non-commercial partners.
Sun's trademark license for usage of the Java brand insists that all implementations be "compatible".
This resulted in a legal dispute with Microsoft after Sun claimed that the Microsoft implementation did not support the RMI and JNI interfaces and had added platform-specific features of their own.
Sun sued in 1997, and in 2001 won a settlement of $20 million as well as a court order enforcing the terms of the license from Sun.
As a result, Microsoft no longer ships Java with Windows, and in recent versions of Windows, Internet Explorer cannot support Java applets without a third-party plugin.
However, Sun and others have made available Java run-time systems at no cost for those and other versions of Windows.
Platform-independent Java is essential to the Java Enterprise Edition strategy, and an even more rigorous validation is required to certify an implementation.
This environment enables portable server-side applications, such as Web services, servlets, and Enterprise JavaBeans, as well as with Embedded systems based on OSGi, using Embedded Java environments.
Through the new GlassFish project, Sun is working to create a fully functional, unified open-source implementation of the Java EE technologies.
Automatic memory management
One of the ideas behind Java's automatic memory management model is that programmers be spared the burden of having to perform manual memory management.
In some languages the programmer allocates memory for the creation of objects stored on the heap and the responsibility of later deallocating that memory also resides with the programmer.
If the programmer forgets to deallocate memory or writes code that fails to do so, a memory leak occurs and the program can consume an arbitrarily large amount of memory.
Additionally, if the program attempts to deallocate the region of memory more than once, the result is undefined and the program may become unstable and may crash.
Finally, in non garbage collected environments, there is a certain degree of overhead and complexity of user-code to track and finalize allocations.
Often developers may box themselves into certain designs to provide reasonable assurances that memory leaks will not occur.
In Java, this potential problem is avoided by automatic garbage collection.
The programmer determines when objects are created, and the Java runtime is responsible for managing the object's lifecycle.
The program or other objects can reference an object by holding a reference to it (which, from a low-level point of view, is its address on the heap).
When no references to an object remain, the unreachable object is eligible for release by the Java garbage collector - it may be freed automatically by the garbage collector at any time.
Memory leaks may still occur if a programmer's code holds a reference to an object that is no longer needed—in other words, they can still occur but at higher conceptual levels.
The use of garbage collection in a language can also affect programming paradigms.
If, for example, the developer assumes that the cost of memory allocation/recollection is low, they may choose to more freely construct objects instead of pre-initializing, holding and reusing them.
With the small cost of potential performance penalties (inner-loop construction of large/complex objects), this facilitates thread-isolation (no need to synchronize as different threads work on different object instances) and data-hiding.
The use of transient immutable value-objects minimizes side-effect programming.
Comparing Java and C++, it is possible in C++ to implement similar functionality (for example, a memory management model for specific classes can be designed in C++ to improve speed and lower memory fragmentation considerably), with the possible cost of adding comparable runtime overhead to that of Java's garbage collector, and of added development time and application complexity if one favors manual implementation over using an existing third-party library.
In Java, garbage collection is built-in and virtually invisible to the developer.
That is, developers may have no notion of when garbage collection will take place as it may not necessarily correlate with any actions being explicitly performed by the code they write.
Depending on intended application, this can be beneficial or disadvantageous: the programmer is freed from performing low-level tasks, but at the same time loses the option of writing lower level code.
Additionally, the garbage collection capability demands some attention to tuning the JVM, as large heaps will cause apparently random stalls in performance.
Java does not support pointer arithmetic as is supported in, for example, C++.
This is because the garbage collector may relocate referenced objects, invalidating such pointers.
Another reason that Java forbids this is that type safety and security can no longer be guaranteed if arbitrary manipulation of pointers is allowed.
Syntax
The syntax of Java is largely derived from C++.
Unlike C++, which combines the syntax for structured, generic, and object-oriented programming, Java was built exclusively as an object oriented language.
As a result, almost everything is an object and all code is written inside a class.
The exceptions are the intrinsic data types (ordinal and real numbers, boolean values, and characters), which are not classes for performance reasons.
Hello, world program
This is a minimal Hello world program in Java with syntax highlighting:
To execute a Java program, the code is saved as a file named
.
It must first be compiled into bytecode using a Java compiler, which produces a file named
.
This class is then launched.
The above example merits a bit of explanation.
All executable statements in Java are written inside a class, including stand-alone programs.
Source files are by convention named the same as the class they contain, appending the mandatory suffix .java.
A
that is declared
is required to follow this convention.
(In this case, the class
is public, therefore the source must be stored in a file called Hello.java).
The compiler will generate a class file for each class defined in the source file.
The name of the class file is the name of the class, with .class appended.
For class file generation, anonymous classes are treated as if their name was the concatenation of the name of their enclosing class, a $, and an integer.
The keyword
denotes that a method can be called from code in other classes, or that a class may be used by classes outside the class hierarchy.
The keyword
indicates that the method is a static method, associated with the class rather than object instances.
The keyword
indicates that the main method does not return any value to the caller.
The method name "
" is not a keyword in the Java language.
It is simply the name of the method the Java launcher calls to pass control to the program.
Java classes that run in managed environments such as applets and Enterprise Java Beans do not use or need a
method.
The main method must accept an array of objects.
By convention, it is referenced as
although any other legal identifier name can be used.
Since Java 5, the main method can also use variable arguments, in the form of
, allowing the main method to be invoked with an arbitrary number of
arguments.
The effect of this alternate declaration is semantically identical (the
parameter is still an array of
objects), but allows an alternate syntax for creating and passing the array.
The Java launcher launches Java by loading a given class (specified on the command line) and starting its
method.
Stand-alone programs must declare this method explicitly.
The
parameter is an array of objects containing any arguments passed to the class.
The parameters to
are often passed by means of a command line.
The printing facility is part of the Java standard library: The class defines a public static field called .
The
object is an instance of the class and provides the method for displaying data to the screen while creating a new line (standard out).
A more comprehensive example
The import statement imports the class from the package.
The
class declares a single
field of type
named
.
Every instance of the
class has its own copy of the
field.
The private declaration means that no other class can access (read or write) the
field.
is a
constructor.
Constructors have the same name as the enclosing class they are declared in, and unlike a method, have no return type.
A constructor is used to initialize an object that is a newly created instance of the class.
The dialog returns a
that is converted to an
by the method.
The
method is declared without the
keyword.
This means that the method is invoked using a specific instance of the
class.
(The reference used to invoke the method is passed as an undeclared parameter of type
named
.)
The method tests the expression
using the
keyword to see if the remainder of dividing the
field belonging to the instance of the class by two is zero.
If this expression is true, then it prints Even; if this expression is false it prints Odd.
(The
field can be equivalently accessed as
, which explicitly uses the undeclared
parameter.)
declares a local object reference variable in the
method named
.
This variable can hold a reference to an object of type
.
The declaration initializes
by first creating an instance of the
class, using the
keyword and the
constructor, and then assigning this instance to the variable.
The statement
calls the calculate method.
The instance of
object referenced by the
local variable is used to invoke the method and passed as the undeclared
parameter to the
method.
For simplicity, error handling has been ignored in this example.
Entering a value that is not a number will cause the program to crash.
This can be avoided by catching and handling the thrown by
.
Applet
Java applets are programs that are embedded in other applications, typically in a Web page displayed in a Web browser.
The
statements direct the Java compiler to include the and classes in the compilation.
The import statement allows these classes to be referenced in the source code using the simple class name (i.e.
) instead of the fully qualified class name (i.e.
).
The
class
(subclasses) the
class; the
class provides the framework for the host application to display and control the lifecycle of the applet.
The
class is an Abstract Windowing Toolkit (AWT) , which provides the applet with the capability to display a graphical user interface (GUI) and respond to user events.
The
class overrides the method inherited from the superclass to provide the code to display the applet.
The
method is passed a
object that contains the graphic context used to display the applet.
The
method calls the graphic context method to display the "Hello, world!" string at a pixel offset of (
) from the upper-left corner in the applet's display.
An applet is placed in an HTML document using the
HTML element.
The
tag has three attributes set:
specifies the name of the
class and
sets the pixel width and height of the applet.
Applets may also be embedded in HTML using either the
or
element, although support for these elements by Web browsers is inconsistent.
However, the
tag is deprecated, so the
tag is preferred where supported.
The host application, typically a Web browser, instantiates the
applet and creates an for the applet.
Once the applet has initialized itself, it is added to the AWT display hierarchy.
The
method is called by the AWT event dispatching thread whenever the display needs the applet to draw itself.
Servlet
Java Servlet technology provides Web developers with a simple, consistent mechanism for extending the functionality of a Web server and for accessing existing business systems.
Servlets are server-side Java EE components that generate responses (typically HTML pages) to requests (typically HTTP requests) from clients.
A servlet can almost be thought of as an applet that runs on the server side—without a face.
The
statements direct the Java compiler to include all of the public classes and interfaces from the and packages in the compilation.
The
class
the class; the
class provides the interface for the server to forward requests to the servlet and control the servlet's lifecycle.
The
class overrides the method defined by the interface to provide the code for the service request handler.
The
method is passed a object that contains the request from the client and a object used to create the response returned to the client.
The
method declares that it
the exceptions and if a problem prevents it from responding to the request.
The method in the response object is called to set the MIME content type of the returned data to "text/html".
The method in the response returns a object that is used to write the data that is sent to the client.
The method is called to write the "Hello, world!" string to the response and then the method is called to close the print writer, which causes the data that has been written to the stream to be returned to the client.
JavaServer Page
JavaServer Pages (JSPs) are server-side Java EE components that generate responses, typically HTML pages, to HTTP requests from clients.
JSPs embed Java code in an HTML page by using the special delimiters
and
.
A JSP is compiled to a Java servlet, a Java application in its own right, the first time it is accessed.
After that, the generated servlet creates the response.
Swing application
Swing is a graphical user interface library for the Java SE platform.
This example Swing application creates a single window with "Hello, world!" inside:
The first
statement directs the Java compiler to include the class from the package in the compilation; the second
includes all of the public classes and interfaces from the package.
The
class
the class; the
class implements a window with a title bar and a close control.
The
constructor initializes the frame by first calling the superclass constructor, passing the parameter
, which is used as the window's title.
It then calls the method inherited from
to set the default operation when the close control on the title bar is selected to — this causes the
to be disposed of when the frame is closed (as opposed to merely hidden), which allows the JVM to exit and the program to terminate.
Next, the layout of the frame is set to a
; this tells Swing how to arrange the components that will be added to the frame.
A is created for the string "Hello, world!" and the method inherited from the superclass is called to add the label to the frame.
The method inherited from the superclass is called to size the window and lay out its contents, in the manner indicated by the
.
The
method is called by the JVM when the program starts.
It instantiates a new
frame and causes it to be displayed by calling the method inherited from the superclass with the boolean parameter
.
Note that once the frame is displayed, exiting the
method does not cause the program to terminate because the AWT event dispatching thread remains active until all of the Swing top-level windows have been disposed.
Criticism
Java's performance has improved substantially since the early versions, and performance of JIT compilers relative to native compilers has in some tests been shown to be quite similar.
The performance of the compilers does not necessarily indicate the performance of the compiled code; only careful testing can reveal the true performance issues in any system.
The default look and feel of GUI applications written in Java using the Swing toolkit is very different from native applications.
It is possible to specify a different look and feel through the pluggable look and feel system of Swing.
Clones of Windows, GTK and Motif are supplied by Sun.
Apple also provides an Aqua look and feel for Mac OS X.
Though prior implementations of these looks and feels have been considered lacking, Swing in Java SE 6 addresses this problem by using more native widget drawing routines of the underlying platforms.
Alternatively, third party toolkits such as wx4j, Qt Jambi or SWT may be used for increased integration with the native windowing system.
As in C++ and some other object-oriented languages, variables of Java's primitive types were not originally objects.
Values of primitive types are either stored directly in fields (for objects) or on the stack (for methods) rather than on the heap, as is the common case for objects (but see Escape analysis).
This was a conscious decision by Java's designers for performance reasons.
Because of this, Java was not considered to be a pure object-oriented programming language.
However, as of Java 5.0, autoboxing enables programmers to write as if primitive types are their wrapper classes, with their object-oriented counterparts representing classes of their own, and freely interchange between them for improved flexibility.
Java suppresses several features (such as operator overloading and multiple inheritance) for classes in order to simplify the language, to "save the programmers from themselves", and to prevent possible errors and anti-pattern design.
This has been a source of criticism, relating to a lack of low-level features, but some of these limitations may be worked around.
Java interfaces have always had multiple inheritance.
Resources
Java Runtime Environment
The Java Runtime Environment, or JRE, is the software required to run any application deployed on the Java Platform.
End-users commonly use a JRE in software packages and Web browser plugins.
Sun also distributes a superset of the JRE called the Java 2 SDK (more commonly known as the JDK), which includes development tools such as the Java compiler, Javadoc, Jar and debugger.
One of the unique advantages of the concept of a runtime engine is that errors (exceptions) should not 'crash' the system.
Moreover, in runtime engine environments such as Java there exist tools that attach to the runtime engine and every time that an exception of interest occurs they record debugging information that existed in memory at the time the exception was thrown (stack and heap values).
These Automated Exception Handling tools provide 'root-cause' information for exceptions in Java programs that run in production, testing or development environments.
Components
Java libraries are the compiled byte codes of source code developed by the JRE implementor to support application development in Java.
Examples of these libraries are:
The core libraries, which include:
Collection libraries that implement data structures such as lists, dictionaries, trees and sets
XML Processing (Parsing, Transforming, Validating) libraries
Security
Internationalization and localization libraries
The integration libraries, which allow the application writer to communicate with external systems.
These libraries include:
The Java Database Connectivity (JDBC) API for database access
Java Naming and Directory Interface (JNDI) for lookup and discovery
RMI and CORBA for distributed application development
User Interface libraries, which include:
The (heavyweight, or native) Abstract Windowing Toolkit (AWT), which provides GUI components, the means for laying out those components and the means for handling events from those components
The (lightweight) Swing libraries, which are built on AWT but provide (non-native) implementations of the AWT widgetry
APIs for audio capture, processing, and playback
A platform dependent implementation of Java virtual machine (JVM) that is the means by which the byte codes of the Java libraries and third party applications are executed
Plugins, which enable applets to be run in Web browsers
Java Web Start, which allows Java applications to be efficiently distributed to end users across the Internet
Licensing and documentation
APIs
Sun has defined three platforms targeting different application environments and segmented many of its APIs so that they belong to one of the platforms.
The platforms are:
Java Platform, Micro Edition (Java ME) — targeting environments with limited resources,
Java Platform, Standard Edition (Java SE) — targeting workstation environments, and
Java Platform, Enterprise Edition (Java EE) — targeting large distributed enterprise or Internet environments.
The classes in the Java APIs are organized into separate groups called packages.
Each package contains a set of related interfaces, classes and exceptions.
Refer to the separate platforms for a description of the packages available.
The set of APIs is controlled by Sun Microsystems in cooperation with others through the Java Community Process program.
Companies or individuals participating in this process can influence the design and development of the APIs.
This process has been a subject of controversy.
Language
A language is a dynamic set of visual, auditory, or tactile symbols of communication and the elements used to manipulate them.
Language can also refer to the use of such systems as a general phenomenon.
Language is considered to be an exclusively human mode of communication; although other animals make use of quite sophisticated communicative systems, none of these are known to make use of all of the properties that linguists use to define language.
Properties of language
A set of agreed-upon symbols is only one feature of language; all languages must define the structural relationships between these symbols in a system of grammar.
Rules of grammar are what distinguish language from other forms of communication.
They allow a finite set of symbols to be manipulated to create a potentially infinite number of grammatical utterances.
Another property of language is that its symbols are arbitrary.
Any concept or grammatical rule can be mapped onto a symbol.
Most languages make use of sound, but the combinations of sounds used do not have any inherent meaning – they are merely an agreed-upon convention to represent a certain thing by users of that language.
For instance, there is nothing about the Spanish word itself that forces Spanish speakers to convey the idea of "nothing".
Another set of sounds (for example, the English word nothing) could equally be used to represent the same concept, but all Spanish speakers have acquired or learned to correlate this meaning for this particular sound pattern.
For Slovenian, Croatian, Serbian/Kosovan or Bosnian speakers on the other hand, means something else; it means "hope".
The study of language
Linguistics
Linguistics is the scientific and philosophical study of language, encompassing a number of sub-fields.
At the core of theoretical linguistics are the study of language structure (grammar) and the study of meaning (semantics).
The first of these encompasses morphology (the formation and composition of words), syntax (the rules that determine how words combine into phrases and sentences) and phonology (the study of sound systems and abstract sound units).
Phonetics is a related branch of linguistics concerned with the actual properties of speech sounds (phones), non-speech sounds, and how they are produced and perceived.
Theoretical linguistics is mostly concerned with developing models of linguistic knowledge.
The fields that are generally considered as the core of theoretical linguistics are syntax, phonology, morphology, and semantics.
Applied linguistics attempts to put linguistic theories into practice through areas like translation, stylistics, literary criticism and theory, discourse analysis, speech therapy, speech pathology and foreign language teaching.
History
The historical record of linguistics begins in India with Pāṇini, the 5th century BCE grammarian who formulated 3,959 rules of Sanskrit morphology, known as the Aṣṭādhyāyī (अष्टाध्यायी) and with Tolkāppiyar, the 3rd century BCE grammarian of the Tamil work Tolkāppiyam. grammar is highly systematized and technical.
Inherent in its analytic approach are the concepts of the phoneme, the morpheme, and the root; Western linguists only recognized the phoneme some two millennia later.
Tolkāppiyar's work is perhaps the first to describe articulatory phonetics for a language.
Its classification of the alphabet into consonants and vowels, and elements like nouns, verbs, vowels, and consonants, which he put into classes, were also breakthroughs at the time.
In the Middle East, the Persian linguist Sibawayh (سیبویه) made a detailed and professional description of Arabic in 760 CE in his monumental work, Al-kitab fi al-nahw (الكتاب في النحو, The Book on Grammar), bringing many linguistic aspects of language to light.
In his book, he distinguished phonetics from phonology.
Later in the West, the success of science, mathematics, and other formal systems in the 20th century led many to attempt a formalization of the study of language as a "semantic code".
This resulted in the academic discipline of linguistics, the founding of which is attributed to Ferdinand de Saussure.
In the 20th century, substantial contributions to the understanding of language came from Ferdinand de Saussure, Hjelmslev, Émile Benveniste and Roman Jakobson, which are characterized as being highly systematic.
Human languages
Human languages are usually referred to as natural languages, and the science of studying them falls under the purview of linguistics.
A common progression for natural languages is that they are considered to be first spoken, then written, and then an understanding and explanation of their grammar is attempted.
Languages live, die, move from place to place, and change with time.
Any language that ceases to change or develop is categorized as a dead language.
Conversely, any language that is a living language, that is, it is in a continuous state of change, is known as a modern language.
Making a principled distinction between one language and another is usually impossible.
For instance, there are a few dialects of German similar to some dialects of Dutch.
The transition between languages within the same language family is sometimes gradual (see dialect continuum).
Some like to make parallels with biology, where it is not possible to make a well-defined distinction between one species and the next.
In either case, the ultimate difficulty may stem from the interactions between languages and populations.
(See Dialect or August Schleicher for a longer discussion.)
The concepts of Ausbausprache, Abstandsprache and Dachsprache are used to make finer distinctions about the degrees of difference between languages or dialects.
Artificial languages
Constructed languages
Some individuals and groups have constructed their own artificial languages, for practical, experimental, personal, or ideological reasons.
International auxiliary languages are generally constructed languages that strive to be easier to learn than natural languages; other constructed languages strive to be more logical ("loglangs") than natural languages; a prominent example of this is Lojban.
Some writers, such as J. R. R. Tolkien, have created fantasy languages, for literary, artistic or personal reasons.
The fantasy language of the Klingon race has in recent years been developed by fans of the Star Trek series, including a vocabulary and grammar.
Constructed languages are not necessarily restricted to the properties shared by natural languages.
This part of ISO 639 also includes identifiers that denote constructed (or artificial) languages.
In order to qualify for inclusion the language must have a literature and it must be designed for the purpose of human communication.
Specifically excluded are reconstructed languages and computer programming languages.
International auxiliary languages
Some languages, most constructed, are meant specifically for communication between people of different nationalities or language groups as an easy-to-learn second language.
Several of these languages have been constructed by individuals or groups.
Natural, pre-existing languages may also be used in this way - their developers merely catalogued and standardized their vocabulary and identified their grammatical rules.
These languages are called naturalistic.
One such language, Latino Sine Flexione, is a simplified form of Latin.
Two others, Occidental and Novial, were drawn from several Western languages.
To date, the most successful auxiliary language is Esperanto, invented by Polish ophthalmologist Zamenhof.
It has a relatively large community roughly estimated at about 2 million speakers worldwide, with a large body of literature, songs, and is the only known constructed language to have native speakers, such as the Hungarian-born American businessman George Soros.
Other auxiliary languages with a relatively large number of speakers and literature are Interlingua and Ido.
Controlled languages
Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity.
The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language.
An example of a widely used controlled natural language is Simplified English, which was originally developed for aerospace industry maintenance manuals.
Formal languages
Mathematics and computer science use artificial entities called formal languages (including programming languages and markup languages, and some that are more theoretical in nature).
These often take the form of character strings, produced by a combination of formal grammar and semantics of arbitrary complexity.
Programming languages
A programming language is an extreme case of a formal language that can be used to control the behavior of a machine, particularly a computer, to perform specific tasks.
Programming languages are defined using syntactic and semantic rules, to determine structure and meaning respectively.
Programming languages are used to facilitate communication about the task of organizing and manipulating information, and to express algorithms precisely.
Some authors restrict the term "programming language" to those languages that can express all possible algorithms; sometimes the term "computer language" is used for artificial languages that are more limited.
Animal communication
The term "animal languages" is often used for non-human languages.
Linguists do not consider these to be "language", but describe them as animal communication, because the interaction between animals in such communication is fundamentally different in its underlying principles from human language.
Nevertheless, some scholars have tried to disprove this mainstream premise through experiments on training chimpanzees to talk.
Karl von Frisch received the Nobel Prize in 1973 for his proof of the language and dialects of the bees.
In several publicized instances, non-human animals have been taught to understand certain features of human language.
Chimpanzees, gorillas, and orangutans have been taught hand signs based on American Sign Language.
The African Grey Parrot, which possesses the ability to mimic human speech with a high degree of accuracy, is suspected of having sufficient intelligence to comprehend some of the speech it mimics.
Most species of parrot, despite expert mimicry, are believed to have no linguistic comprehension at all.
While proponents of animal communication systems have debated levels of semantics, these systems have not been found to have anything approaching human language syntax.
Language model
A statistical language model assigns a probability to a sequence of m words by means of a probability distribution.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.
In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a document in a collection.
With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md).
Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting).
For that reason these models are often approximated using smoothed N-gram models.
N-gram models
In an n-gram model, the probability of observing the sentence w1,...,wm is approximated as
Here, it is assumed that the probability of observing the ith word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words (nth order Markov property).
The conditional probability can be calculated from n-gram frequency counts:
The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively.
Example
In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as
whereas in a trigram (n=3) language model, the approximation is
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
LSA was patented in 1988 (US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter.
In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).
Occurrence matrix
LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents, typically stemmed words that appear in the documents.
A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.
This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used.
LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and the documents.
Thus the terms and documents are now indirectly related through the concepts.
Applications
The new concept space typically can be used to:
Compare the documents in the concept space (data clustering, document classification)......
Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval).
Find relations between terms (synonymy and polysemy).
Given a query of terms, translate it into the concept space, and find matching documents (information retrieval).
Synonymy and polysemy are fundamental problems in natural language processing:
Synonymy is the phenomenon where different words describe the same idea.
Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query.
For example, a search for "doctors" may not return a document containing the word "physicians", even though the words have the same meaning.
Polysemy is the phenomenon where the same word has multiple meanings.
So a search may retrieve irrelevant documents containing the desired words in the wrong meaning.
For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.
Rank lowering
After the construction of the occurrence matrix, LSA finds a low-rank approximation to the term-document matrix.
There could be various reasons for these approximations:
The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").
The original term-document matrix is presumed noisy: for example, anecdotal instances of terms are to be eliminated.
From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).
The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix.
That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document--generally a much larger set due to synonymy.
The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:
{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
This mitigates synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings.
It also mitigates polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning.
Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.
Derivation
Let be a matrix where element describes the occurrence of term in document (this can be, for example, the frequency).
will look like this:
Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document:
Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term:
Now the dot product between two term vectors gives the correlation between the terms over the documents.
The matrix product contains all these dot products.
Element (which is equal to element ) contains the dot product ().
Likewise, the matrix contains the dot products between all the document vectors, giving their correlation over the terms: .
Now assume that there exists a decomposition of such that and are orthonormal matrices and is a diagonal matrix.
This is called a singular value decomposition (SVD):
The matrix products giving us the term and document correlations then become
Since and are diagonal we see that must contain the eigenvectors of , while must be the eigenvectors of .
Both products have the same non-zero eigenvalues, given by the non-zero entries of , or equally, by the non-zero entries of .
Now the decomposition looks like this:
The values are called the singular values, and and the left and right singular vectors.
Notice how the only part of that contributes to is the row.
Let this row vector be called .
Likewise, the only part of that contributes to is the column, .
These are not the eigenvectors, but depend on all the eigenvectors.
It turns out that when you select the largest singular values, and their corresponding singular vectors from and , you get the rank approximation to X with the smallest error (Frobenius norm).
The amazing thing about this approximation is that not only does it have a minimal error, but it translates the term and document vectors into a concept space.
The vector then has entries, each giving the occurrence of term in one of the concepts.
Likewise, the vector gives the relation between document and each concept.
We write this approximation as
You can now do the following:
See how related documents and are in the concept space by comparing the vectors and (typically by cosine similarity).
This gives you a clustering of the documents.
Comparing terms and by comparing the vectors and , giving you a clustering of the terms in the concept space.
Given a query, view this as a mini document, and compare it to your documents in the concept space.
To do the latter, you must first translate your query into the concept space.
It is then intuitive that you must use the same transformation that you use on your documents:
This means that if you have a query vector , you must do the translation before you compare it with the document vectors in the concept space.
You can do the same for pseudo term vectors:
Implementation
The SVD is typically computed using large matrix methods (for example, Lanczos methods) but may also be computed incrementally and with greatly reduced resources via a neural network-like approach, which does not require the large, full-rank matrix to be held in memory (Gorrell and Webb, 2005).
A fast, incremental, low-memory, large-matrix SVD algorithm has recently been developed (Brand, 2006).
Unlike Gorrell and Webb's (2005) stochastic approximation, Brand's (2006) algorithm provides an exact solution.
Limitations
LSA has two drawbacks:
The resulting dimensions might be difficult to interpret.
For instance, in
{(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle".
However, it is very likely that cases close to
{(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)}
will occur.
This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language.
The probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed.
Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA .
Linguistics
Linguistics is the scientific study of language, encompassing a number of sub-fields.
An important topical division is between the study of language structure (grammar) and the study of meaning (semantics).
Grammar encompasses morphology (the formation and composition of words), syntax (the rules that determine how words combine into phrases and sentences) and phonology (the study of sound systems and abstract sound units).
Phonetics is a related branch of linguistics concerned with the actual properties of speech sounds (phones), non-speech sounds, and how they are produced and perceived.
Over the twentieth century, following the work of Noam Chomsky, linguistics came to be dominated by the Generativist school, which is chiefly concerned with explaining how human beings acquire language and the biological constraints on this acquisition; generative theory is modularist in character.
While this remains the dominant paradigm, other linguistic theories have increasingly gained in popularity — cognitive linguistics being a prominent example.
There are many sub-fields in linguistics, which may or may not be dominated by a particular theoretical approach: evolutionary linguistics, for example, attempts to account for the origins of language; historical linguistics explores language change; and sociolinguistics looks at the relation between linguistic variation and social structures.
A variety of intellectual disciplines are relevant to the study of language.
Although certain linguists have downplayed the relevance of some other fields, linguistics — like other sciences — is highly interdisciplinary and draws on work from such fields as psychology, informatics, computer science, philosophy, biology, human anatomy, neuroscience, sociology, anthropology, and acoustics.
Names for the discipline
Before the twentieth century (the word is first attested 1716), the term "philology" was commonly used to refer to the science of language, which was then predominately historical in focus.
Since Ferdinand de Saussure's insistence on the importance of synchronic analysis, however, this focus has shifted and the term "philology" is now generally used for the "study of a language's grammar, history and literary tradition", especially in the USA., where it was never as popular as elsewhere in the sense "science of language".
The term "linguistics" dates from 1847, although "linguist" in the sense a student of language" dates from 1641.
It is now the usual academic term in English for the scientific study of language.
Fundamental concerns and divisions
Linguistics concerns itself with describing and explaining the nature of human language.
Relevant to this are the questions of what is universal to language, how language can vary, and how human beings come to know languages.
All humans (setting aside extremely pathological cases) achieve competence in whatever language is spoken (or signed, in the case of signed languages) around them when growing up, with apparently little need for explicit conscious instruction.
While non-humans acquire their own communication systems, they do not acquire human language in this way (although many non-human animals can learn to respond to language, or can even be trained to use it to a degree).
Therefore, linguists assume, the ability to acquire and use language is an innate, biologically-based potential of modern human beings, similar to the ability to walk.
There is no consensus, however, as to the extent of this innate potential, or its domain-specificity (the degree to which such innate abilities are specific to language), with some theorists claiming that there is a very large set of highly abstract and specific binary settings coded into the human brain, while others claim that the ability to learn language is a product of general human cognition.
It is, however, generally agreed that there are no strong genetic differences underlying the differences between languages: an individual will acquire whatever language(s) they are exposed to as a child, regardless of parentage or ethnic origin.
Linguistic structures are pairings of meaning and form (which may consist of sound patterns, movements of the hand, written symbols, and so on); such pairings are known as Saussurean signs.
Linguists may specialize in some sub-area of linguistic structure, which can be arranged in the following terms, from form to meaning:
Phonetics, the study of the physical properties of speech (or signed) production and perception
Phonology, the study of sounds (adjusted appropriately for signed languages) as discrete, abstract elements in the speaker's mind that distinguish meaning
Morphology, the study of internal structures of words and how they can be modified
Syntax, the study of how words combine to form grammatical sentences
Semantics, the study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences
Pragmatics, the study of how utterances are used (literally, figuratively, or otherwise) in communicative acts
Discourse analysis, the analysis of language use in texts (spoken, written, or signed)
Many linguists would agree that these divisions overlap considerably, and the independent significance of each of these areas is not universally acknowledged.
Regardless of any particular linguist's position, each area has core concepts that foster significant scholarly inquiry and research.
Intersecting with these domains are fields arranged around the kind of external factors that are considered.
For example
Linguistic typology, the study of the common properties of diverse unrelated languages, properties that may, given sufficient attestation, be assumed to be innate to human language capacity.
Stylistics, the study of linguistic factors that place a discourse in context.
Developmental linguistics, the study of the development of linguistic ability in an individual, particularly the acquisition of language in childhood.
Historical linguistics or Diachronic linguistics, the study of language change.
Language geography, the study of the spatial patterns of languages.
Evolutionary linguistics, the study of the origin and subsequent development of language.
Psycholinguistics, the study of the cognitive processes and representations underlying language use.
Sociolinguistics, the study of social patterns and norms of linguistic variability.
Clinical linguistics, the application of linguistic theory to the area of Speech-Language Pathology.
Neurolinguistics, the study of the brain networks that underlie grammar and communication.
Biolinguistics, the study of natural as well as human-taught communication systems in animals compared to human language.
Computational linguistics, the study of computational implementations of linguistic structures.
Applied linguistics, the study of language related issues applied in everyday life, notably language. policies, planning, and education.
Constructed language fits under Applied linguistics.
The related discipline of semiotics investigates the relationship between signs and what they signify.
From the perspective of semiotics, language can be seen as a sign or symbol, with the world as its representation.
Variation and universality
Much modern linguistic research, particularly within the paradigm of generative grammar, has concerned itself with trying to account for differences between languages of the world.
This has worked on the assumption that if human linguistic ability is narrowly constrained by human biology, then all languages must share certain fundamental properties.
In generativist theory, the collection of fundamental properties all languages share are referred to as universal grammar (UG).
The specific characteristics of this universal grammar are a much debated topic.
Typologists and non-generativist linguists usually refer simply to language universals, or universals of language.
Similarities between languages can have a number of different origins.
In the simplest case, universal properties may be due to universal aspects of human experience.
For example, all humans experience water, and all human languages have a word for water.
Other similarities may be due to common descent: the Latin language spoken by the Ancient Romans developed into Spanish in Spain and Italian in Italy; similarities between Spanish and Italian are thus in many cases due to both being descended from Latin.
In other cases, contact between languages — particularly where many speakers are bilingual — can lead to much borrowing of structures, as well as words.
Similarity may also, of course, be due to coincidence.
English much and Spanish mucho are not descended from the same form or borrowed from one language to the other; nor is the similarity due to innate linguistic knowledge (see False cognate).
Arguments in favor of language universals have also come from documented cases of sign languages (such as Al-Sayyid Bedouin Sign Language) developing in communities of congenitally deaf people, independently of spoken language.
The properties of these sign languages conform generally to many of the properties of spoken languages.
Other known and suspected sign language isolates include Kata Kolok, Nicaraguan Sign Language, and Providence Island Sign Language.
Structures
It has been perceived that languages tend to be organized around grammatical categories such as noun and verb, nominative and accusative, or present and past, though, importantly, not exclusively so.
The grammar of a language is organized around such fundamental categories, though many languages express the relationships between words and syntax in other discrete ways (cf. some Bantu languages for noun/verb relations, ergative/absolutive systems for case relations, several Native American languages for tense/aspect relations).
In addition to making substantial use of discrete categories, language has the important property that it organizes elements into recursive structures; this allows, for example, a noun phrase to contain another noun phrase (as in “the chimpanzee’s lips”) or a clause to contain a clause (as in “I think that it’s raining”).
Though recursion in grammar was implicitly recognized much earlier (for example by Jespersen), the importance of this aspect of language became more popular after the 1957 publication of Noam Chomsky’s book “Syntactic Structures”, - that presented a formal grammar of a fragment of English.
Prior to this, the most detailed descriptions of linguistic systems were of phonological or morphological systems.
Chomsky used a context-free grammar augmented with transformations.
Since then, following the trend of Chomskyan linguistics, context-free grammars have been written for substantial fragments of various languages (for example GPSG, for English), but it has been demonstrated that human languages include cross-serial dependencies, which cannot be handled adequately by context-free grammars.
Some selected sub-fields
Diachronic linguistics
Studying languages at a particular point in time (usually the present) is "synchronic", while diachronic linguistics examines how language changes through time, sometimes over centuries.
It enjoys both a rich history and a strong theoretical foundation for the study of language change.
In universities in the United States, the non-historic perspective is often out of fashion.
The shift in focus to a non-historic perspective started with Saussure and became pre-dominant with Noam Chomsky.
Explicitly historical perspectives include historical-comparative linguistics and etymology.
Contextual linguistics
Contextual linguistics may include the study of linguistics in interaction with other academic disciplines.
The interdisciplinary areas of linguistics consider how language interacts with the rest of the world.
Sociolinguistics, anthropological linguistics, and linguistic anthropology are seen as areas that bridge the gap between linguistics and society as a whole.
Psycholinguistics and neurolinguistics relate linguistics to the medical sciences.
Other cross-disciplinary areas of linguistics include evolutionary linguistics, computational linguistics and cognitive science.
Applied linguistics
Linguists are largely concerned with finding and describing the generalities and varieties both within particular languages and among all language.
Applied linguistics takes the result of those findings and “applies” them to other areas.
Often “applied linguistics” refers to the use of linguistic research in language teaching, but results of linguistic research are used in many other areas, as well.
Today in the age of information technology, many areas of applied linguistics attempt to involve the use of computers.
Speech synthesis and speech recognition use phonetic and phonemic knowledge to provide voice interfaces to computers.
Applications of computational linguistics in machine translation, computer-assisted translation, and natural language processing are areas of applied linguistics which have come to the forefront.
Their influence has had an effect on theories of syntax and semantics, as modeling syntactic and semantic theories on computers constraints.
Description and prescription
Main articles: Descriptive linguistics, Linguistic prescription
Linguistics is descriptive; linguists describe and explain features of language without making subjective judgments on whether a particular feature is "right" or "wrong".
This is analogous to practice in other sciences: a zoologist studies the animal kingdom without making subjective judgments on whether a particular animal is better or worse than another.
Prescription, on the other hand, is an attempt to promote particular linguistic usages over others, often favouring a particular dialect or "acrolect".
This may have the aim of establishing a linguistic standard, which can aid communication over large geographical areas.
It may also, however, be an attempt by speakers of one language or dialect to exert influence over speakers of other languages or dialects (see Linguistic imperialism).
An extreme version of prescriptivism can be found among censors, who attempt to eradicate words and structures which they consider to be destructive to society.
Speech and writing
Most contemporary linguists work under the assumption that spoken (or signed) language is more fundamental than written language.
This is because:
Speech appears to be a human "universal", whereas there have been many cultures and speech communities that lack written communication;
Speech evolved before human beings discovered writing;
People learn to speak and process spoken languages more easily and much earlier than writing;
Linguists nonetheless agree that the study of written language can be worthwhile and valuable.
For research that relies on corpus linguistics and computational linguistics, written language is often much more convenient for processing large amounts of linguistic data.
Large corpora of spoken language are difficult to create and hard to find, and are typically transcribed and written.
Additionally, linguists have turned to text-based discourse occurring in various formats of computer-mediated communication as a viable site for linguistic inquiry.
The study of writing systems themselves is in any case considered a branch of linguistics.
History
Some of the earliest linguistic activities can be recalled from Iron Age India with the analysis of Sanskrit.
The Pratishakhyas (from ca. the 8th century BC) constitute as it were a proto-linguistic ad hoc collection of observations about mutations to a given corpus particular to a given Vedic school.
Systematic study of these texts gives rise to the Vedanga discipline of Vyakarana, the earliest surviving account of which is the work of Pānini (c. 520 – 460 BC), who, however, looks back on what are probably several generations of grammarians, whose opinions he occasionally refers to.
Pānini formulates close to 4,000 rules which together form a compact generative grammar of Sanskrit.
Inherent in his analytic approach are the concepts of the phoneme, the morpheme and the root.
Due to its focus on brevity, his grammar has a highly unintuitive structure, reminiscent of contemporary "machine language" (as opposed to "human readable" programming languages).
Indian linguistics maintained a high level for several centuries; Patanjali in the 2nd century BC still actively criticizes Panini.
In the later centuries BC, however, Panini's grammar came to be seen as prescriptive, and commentators came to be fully dependent on it.
Bhartrihari (c. 450 – 510) theorized the act of speech as being made up of four stages: first, conceptualization of an idea, second, its verbalization and sequencing (articulation) and third, delivery of speech into atmospheric air, the interpretation of speech by the listener, the interpreter.
In the Middle East, the Persian linguist Sibawayh made a detailed and professional description of Arabic in 760, in his monumental work, Al-kitab fi al-nahw (الكتاب في النحو, The Book on Grammar), bringing many linguistic aspects of language to light.
In his book he distinguished phonetics from phonology.
Western linguistics begins in Classical Antiquity with grammatical speculation such as Plato's Cratylus.
Sir William Jones noted that Sanskrit shared many common features with classical Latin and Greek, notably verb roots and grammatical structures, such as the case system.
This led to the theory that all languages sprung from a common source and to the discovery of the Indo-European language family.
He began the study of comparative linguistics, which would uncover more language families and branches.
Some early-19th-century linguists were Jakob Grimm, who devised a principle of consonantal shifts in pronunciation – known as Grimm's Law – in 1822; Karl Verner, who formulated Verner's Law; August Schleicher, who created the "Stammbaumtheorie" ("family tree"); and Johannes Schmidt, who developed the "Wellentheorie" ("wave model") in 1872.
Ferdinand de Saussure was the founder of modern structural linguistics.
Edward Sapir, a leader in American structural linguistics, was one of the first who explored the relations between language studies and anthropology.
His methodology had strong influence on all his successors.
Noam Chomsky's formal model of language, transformational-generative grammar, developed under the influence of his teacher Zellig Harris, who was in turn strongly influenced by Leonard Bloomfield, has been the dominant model since the 1960s.
Noam Chomsky remains a pop-linguistic figure.
Linguists (working in frameworks such as Head-Driven Phrase Structure Grammar (HPSG) or Lexical Functional Grammar (LFG)) are increasingly seen to stress the importance of formalization and formal rigor in linguistic description, and may distance themselves somewhat from Chomsky's more recent work (the "Minimalist" program for Transformational grammar), connecting more closely to his earlier works.
Other linguists working in Optimality Theory state generalizations in terms of violable constraints that interact with each other, and abandon the traditional rule-based formalism first pioneered by early work in generativist linguistics.
Functionalist linguists working in functional grammar and Cognitive Linguistics tend to stress the non-autonomy of linguistic knowledge and the non-universality of linguistic structures, thus differing significantly from the Chomskyan school.
They reject Chomskyan intuitive introspection as a scientific method, relying instead on typological evidence.
Linux
Linux (commonly pronounced in English; variants exist) is a Unix-like computer operating system.
Linux is one of the most prominent examples of free software and open source development: typically all underlying source code can be freely modified, used, and redistributed by anyone.
The name "Linux" comes from the Linux kernel, originally written in 1991 by Linus Torvalds.
The system's utilities and libraries usually come from the GNU operating system, announced in 1983 by Richard Stallman.
The GNU contribution is the basis for the alternative name GNU/Linux.
Predominantly known for its use in servers, Linux is supported by corporations such as Dell, Hewlett-Packard, IBM, Novell, Oracle Corporation, Red Hat, and Sun Microsystems.
It is used as an operating system for a wide variety of computer hardware, including desktop computers, supercomputers, video game systems, such as the PlayStation 2 and PlayStation 3, several arcade games, and embedded devices such as mobile phones, routers, and stage lighting systems.
History
The Unix operating system was conceived and implemented in the 1960s and first released in 1970.
Its wide availability and portability meant that it was widely adopted, copied and modified by academic institutions and businesses, with its design being influential on authors of other systems.
The GNU Project, started in 1984, had the goal of creating a "complete Unix-compatible software system" made entirely of free software.
In 1985, Richard Stallman created the Free Software Foundation and developed the GNU General Public License (GNU GPL).
Many of the programs required in an OS (such as libraries, compilers, text editors, a Unix shell, and a windowing system) were completed by the early 1990s, although low level elements such as device drivers, daemons, and the kernel were stalled and incomplete.
Linus Torvalds has said that if the GNU kernel had been available at the time (1991), he would not have decided to write his own.
MINIX
MINIX, a Unix-like system intended for academic use, was released by Andrew S. Tanenbaum in 1987.
While source code for the system was available, modification and redistribution were restricted (that is not the case today).
In addition, MINIX's 16-bit design was not well adapted to the 32-bit design of the increasingly cheap and popular Intel 386 architecture for personal computers.
In 1991, Torvalds began to work on a non-commercial replacement for MINIX while he was attending the University of Helsinki.
This eventually became the Linux kernel.
In 1992, Tanenbaum posted an article on Usenet claiming Linux was obsolete.
In the article, he criticized the operating system as being monolithic in design and being tied closely to the x86 architecture and thus not portable, as he described "a fundamental error."
Tanenbaum suggested that those who wanted a modern operating system should look into one based on the microkernel model.
The posting elicited the response of Torvalds and Ken Thompson, one of the founders of Unix, which resulted in a well known debate over the microkernel and monolithic kernel designs.
Linux was dependent on the MINIX user space at first.
With code from the GNU system freely available, it was advantageous if this could be used with the fledgling OS.
Code licensed under the GNU GPL can be used in other projects, so long as they also are released under the same or a compatible license.
In order to make the Linux kernel compatible with the components from the GNU Project, Torvalds initiated a switch from his original license (which prohibited commercial redistribution) to the GNU GPL.
Linux and GNU developers worked to integrate GNU components with Linux to make a fully functional and free operating system.
Commercial and popular uptake
Today Linux is used in numerous domains, from embedded systems to supercomputers, and has secured a place in server installations with the popular LAMP application stack.
Torvalds continues to direct the development of the kernel.
Stallman heads the Free Software Foundation, which in turn supports the GNU components.
Finally, individuals and corporations develop third-party non-GNU components.
These third-party components comprise a vast body of work and may include both kernel modules and user applications and libraries.
Linux vendors and communities combine and distribute the kernel, GNU components, and non-GNU components, with additional package management software in the form of Linux distributions.
Design
Linux is a modular Unix-like operating system.
It derives much of its basic design from principles established in Unix during the 1970s and 1980s.
Linux uses a monolithic kernel, the Linux kernel, which handles process control, networking, and peripheral and file system access.
Device drivers are integrated directly with the kernel.
Much of Linux's higher-level functionality is provided by separate projects which interface with the kernel.
The GNU userland is an important part of most Linux systems, providing the shell and Unix tools which carry out many basic operating system tasks.
On top these tools form a Linux system with a graphical user interface that can be used, usually running in the X Window System.
User interface
Linux can be controlled by one or more of a text-based command line interface (CLI), graphical user interface (GUI) (usually the default for desktop), or through controls on the device itself (common on embedded machines).
On desktop machines, KDE, GNOME and Xfce are the most popular user interfaces, though a variety of other user interfaces exist.
Most popular user interfaces run on top of the X Window System (X), which provides network transparency, enabling a graphical application running on one machine to be displayed and controlled from another.
Other GUIs include X window managers such as FVWM, Enlightenment and Window Maker.
The window manager provides a means to control the placement and appearance of individual application windows, and interacts with the X window system.
A Linux system usually provides a CLI of some sort through a shell, which is the traditional way of interacting with a Unix system.
A Linux distribution specialized for servers may use the CLI as its only interface.
A “headless system” run without even a monitor can be controlled by the command line via a protocol such as SSH or telnet.
Most low-level Linux components, including the GNU Userland, use the CLI exclusively.
The CLI is particularly suited for automation of repetitive or delayed tasks, and provides very simple inter-process communication.
A graphical terminal emulator program is often used to access the CLI from a Linux desktop.
Development
The primary difference between Linux and many other popular contemporary operating systems is that the Linux kernel and other components are free and open source software.
Linux is not the only such operating system, although it is the best-known and most widely used.
Some free and open source software licences are based on the principle of copyleft, a kind of reciprocity: any work derived from a copyleft piece of software must also be copyleft itself.
The most common free software license, the GNU GPL, is a form of copyleft, and is used for the Linux kernel and many of the components from the GNU project.
As an operating system underdog competing with mainstream operating systems, Linux cannot rely on a monopoly advantage; in order for Linux to be convenient for users, Linux aims for interoperability with other operating systems and established computing standards.
Linux systems adhere to POSIX, SUS, ISO and ANSI standards where possible, although to date only one Linux distribution has been POSIX.1 certified, Linux-FT.
Free software projects, although developed in a collaborative fashion, are often produced independently of each other.
However, given that the software licenses explicitly permit redistribution, this provides a basis for larger scale projects that collect the software produced by stand-alone projects and make it available all at once in the form of a Linux distribution.
A Linux distribution, commonly called a “distro”, is a project that manages a remote collection of Linux-based software, and facilitates installation of a Linux operating system.
Distributions are maintained by individuals, loose-knit teams, volunteer organizations, and commercial entities.
They include system software and application software in the form of packages, and distribution-specific software for initial system installation and configuration as well as later package upgrades and installs.
A distribution is responsible for the default configuration of installed Linux systems, system security, and more generally integration of the different software packages into a coherent whole.
Community
Linux is largely driven by its developer and user communities.
Some vendors develop and fund their distributions on a volunteer basis, Debian being a well-known example.
Others maintain a community version of their commercial distributions, as Red Hat does with Fedora.
In many cities and regions, local associations known as Linux Users Groups (LUGs) seek to promote Linux and by extension free software.
They hold meetings and provide free demonstrations, training, technical support, and operating system installation to new users.
There are also many Internet communities that seek to provide support to Linux users and developers.
Most distributions and open source projects have IRC chatrooms or newsgroups.
Online forums are another means for support, with notable examples being LinuxQuestions.org and the Gentoo forums.
Linux distributions host mailing lists; commonly there will be a specific topic such as usage or development for a given list.
There are several technology websites with a Linux focus.
Linux Weekly News is a weekly digest of Linux-related news; the Linux Journal is an online magazine of Linux articles published monthly; Slashdot is a technology-related news website with many stories on Linux and open source software; Groklaw has written in depth about Linux-related legal proceedings and there are many articles relevant to the Linux kernel and its relationship with GNU on the GNU project's website.
Print magazines on Linux often include cover disks including software or even complete Linux distributions.
Although Linux is generally available free of charge, several large corporations have established business models that involve selling, supporting, and contributing to Linux and free software.
These include Dell, IBM, HP, Sun Microsystems, Novell, and Red Hat.
The free software licenses on which Linux is based explicitly accommodate and encourage commercialization; the relationship between Linux as a whole and individual vendors may be seen as symbiotic.
One common business model of commercial suppliers is charging for support, especially for business users.
A number of companies also offer a specialized business version of their distribution, which adds proprietary support packages and tools to administer higher numbers of installations or to simplify administrative tasks.
Another business model is to give away the software in order to sell hardware.
Programming on Linux
Most Linux distributions support dozens of programming languages.
The most common collection of utilities for building both Linux applications and operating system programs is found within the GNU toolchain, which includes the GNU Compiler Collection (GCC) and the GNU build system.
Amongst others, GCC provides compilers for Ada, C, C++, Java, and Fortran.
The Linux kernel itself is written to be compiled with GCC.
Proprietary compilers for Linux include the Intel C++ Compiler and IBM XL C/C++ Compiler.
Most distributions also include support for Perl, Ruby, Python and other dynamic languages.
Examples of languages that are less common, but still well-supported, are C# via the Mono project, sponsored by Novell, and Scheme.
A number of Java Virtual Machines and development kits run on Linux, including the original Sun Microsystems JVM (HotSpot), and IBM's J2SE RE, as well as many open-source projects like Kaffe.
The two main frameworks for developing graphical applications are those of GNOME and KDE.
These projects are based on the GTK+ and Qt widget toolkits, respectively, which can also be used independently of the larger framework.
Both support a wide variety of languages.
There are a number of Integrated development environments available including Anjuta, Code::Blocks, Eclipse, KDevelop, Lazarus, MonoDevelop, NetBeans, and Omnis Studio while the long-established editors Vim and Emacs remain popular.
Uses
As well as those designed for general purpose use on desktops and servers, distributions may be specialized for different purposes including: computer architecture support, embedded systems, stability, security, localization to a specific region or language, targeting of specific user groups, support for real-time applications, or commitment to a given desktop environment.
Furthermore, some distributions deliberately include only free software.
Currently, over three hundred distributions are actively developed, with about a dozen distributions being most popular for general-purpose use.
Linux is a widely ported operating system.
While the Linux kernel was originally designed only for Intel 80386 microprocessors, it now runs on a more diverse range of computer architectures than any other operating system: in the hand-held ARM-based iPAQ and the mainframe IBM System z9, in devices ranging from mobile phones to supercomputers.
Specialized distributions exist for less mainstream architectures.
The ELKS kernel fork can run on Intel 8086 or Intel 80286 16-bit microprocessors, while the µClinux kernel fork may run on systems without a memory management unit.
The kernel also runs on architectures that were only ever intended to use a manufacturer-created operating system, such as Macintosh computers, PDAs, video game consoles, portable music players, and mobile phones.
Desktop
Although there is a lack of Linux ports for some Mac OS X and Microsoft Windows programs in domains such as desktop publishing and professional audio, applications equivalent to those available for Mac and Windows are available for Linux.
Most Linux distributions provide a program for browsing a list of thousands of free software applications that have already been tested and configured for a specific distribution.
These free programs can be downloaded and installed with one mouse click and a digital signature guarantees that no one has added a virus or a spyware to these programs.
Many free software titles that are popular on Windows, such as Pidgin, Mozilla Firefox, Openoffice.org, and GIMP, are available for Linux.
A growing amount of proprietary desktop software is also supported under Linux, examples being Adobe Flash Player, Acrobat Reader, Matlab, Nero Burning ROM, Opera, RealPlayer, and Skype.
In the field of animation and visual effects, most high end software, such as AutoDesk Maya, Softimage XSI and Apple Shake, is available for Linux, Windows and/or Mac OS X.
CrossOver is a proprietary solution based on the open source Wine project that supports running older Windows versions of Microsoft Office and Adobe Photoshop versions through CS2.
Microsoft Office 2007 and Adobe Photoshop CS3 are known not to work.
Besides the free Windows compatibility layer Wine, most distributions offer Dual boot and X86 virtualization for running both Linux and Windows on the same computer.
Linux's open nature allows distributed teams to localize Linux distributions for use in locales where localizing proprietary systems would not be cost-effective.
For example the Sinhalese language version of the Knoppix distribution was available for a long time before Microsoft Windows XP was translated to Sinhalese.
In this case the Lanka Linux User Group played a major part in developing the localized system by combining the knowledge of university professors, linguists, and local developers.
The performance of Linux on the desktop has been a controversial topic, with at least one key Linux kernel developer, Con Kolivas, accusing the Linux community of favouring performance on servers.
He quit Linux development because he was frustrated with this lack of focus on the desktop, and then gave a 'tell all' interview on the topic.
Servers and supercomputers
Historically, Linux has mainly been used as a server operating system, and has risen to prominence in that area; Netcraft reported in September 2006 that eight of the ten most reliable internet hosting companies run Linux on their web servers.
This is due to its relative stability and long uptime, and the fact that desktop software with a graphical user interface for servers is often unneeded.
Enterprise and non-enterprise Linux distributions may be found running on servers.
Linux is the cornerstone of the LAMP server-software combination (Linux, Apache, MySQL, Perl/PHP/Python) which has achieved popularity among developers, and which is one of the more common platforms for website hosting.
Linux is commonly used as an operating system for supercomputers.
As of November 2007, out of the top 500 systems, 426 (85.2%) run Linux.
Embedded devices
Due to its low cost and ability to be easily modified, an embedded Linux is often used in embedded systems.
Linux has become a major competitor to the proprietary Symbian OS found in the majority of smartphones — 16.7% of smartphones sold worldwide during 2006 were using Linux — and it is an alternative to the proprietary Windows CE and Palm OS operating systems on mobile devices.
Cell phones or PDAs running on Linux and built on open source platform became a trend from 2007, like Nokia N810, Openmoko's Neo1973 and the on-going Google Android.
The popular TiVo digital video recorder uses a customized version of Linux.
Several network firewall and router standalone products, including several from Linksys, use Linux internally, using its advanced firewall and routing capabilities.
The Korg OASYS and the Yamaha Motif XS music workstations also run Linux.
Further more Linux is used in the leading stage lighting control system, FlyingPig/HighEnd WholeHogIII Console .
Market share and uptake
Many quantitative studies of open source software focus on topics including market share and reliability, with numerous studies specifically examining Linux.
The Linux market is growing rapidly, and the revenue of servers, desktops, and packaged software running Linux is expected to exceed $35.7 billion by 2008.
IDC's report for Q1 2007 says that Linux now holds 12.7% of the overall server market.
This estimate was based on the number of Linux servers sold by various companies.
Desktop adoption of Linux is approximately 1%.
In comparison, Microsoft operating systems hold more than 90%.
The frictional cost of switching operating systems and lack of support for certain hardware and application programs designed for Microsoft Windows have been two factors that have inhibited adoption.
Proponents and analysts attribute the relative success of Linux to its security, reliability, low cost, and freedom from vendor lock-in.
Also most recently Google has begun to fund Wine, which acts as a compatibility layer, allowing users to run some Windows programs under Linux.
The XO laptop project of One Laptop Per Child is creating a new and potentially much larger Linux community, planned to reach several hundred million schoolchildren and their families and communities in developing countries.
Six countries have ordered a million or more units each for delivery in 2007 to distribute to schoolchildren at no charge.
Google, Red Hat, and eBay are major supporters of the project.
Copyright and naming
The Linux kernel and most GNU software are licensed under the GNU General Public License (GPL).
The GPL requires that anyone who distributes the Linux kernel must make the source code (and any modifications) available to the recipient under the same terms.
In 1997, Linus Torvalds stated, “Making Linux GPL'd was definitely the best thing I ever did.”
Other key components of a Linux system may use other licenses; many libraries use the GNU Lesser General Public License (LGPL), a more permissive variant of the GPL, and the X Window System uses the MIT License.
Torvalds has publicly stated that he would not move the Linux kernel (currently licensed under GPL version 2) to version 3 of the GPL, released in mid-2007, specifically citing some provisions in the new license which prohibit the use of the software in digital rights management.
A 2001 study of Red Hat Linux 7.1 found that this distribution contained 30 million source lines of code.
Using the Constructive Cost Model, the study estimated that this distribution required about eight thousand man-years of development time.
According to the study, if all this software had been developed by conventional proprietary means, it would have cost about 1.08 billion dollars (year 2000 U.S. dollars) to develop in the United States.
Most of the code (71%) was written in the C programming language, but many other languages were used, including C++, assembly language, Perl, Python, Fortran, and various shell scripting languages.
Slightly over half of all lines of code were licensed under the GPL.
The Linux kernel itself was 2.4 million lines of code, or 8% of the total.
In a later study, the same analysis was performed for Debian GNU/Linux version 4.0.
This distribution contained over 283 million source lines of code, and the study estimated that it would have cost 5.4 billion Euros to develop by conventional means.
In the United States, the name Linux is a trademark registered to Linus Torvalds.
Initially, nobody registered it, but on August 15 1994, William R. Della Croce, Jr. filed for the trademark Linux, and then demanded royalties from Linux distributors.
In 1996, Torvalds and some affected organizations sued him to have the trademark assigned to Torvalds, and in 1997 the case was settled.
The licensing of the trademark has since been handled by the Linux Mark Institute.
Torvalds has stated that he only trademarked the name to prevent someone else from using it, but was bound in 2005 by United States trademark law to take active measures to enforce the trademark.
As a result, the LMI sent out a number of letters to distribution vendors requesting that a fee be paid for the use of the name, and a number of companies have complied.
GNU/Linux
The Free Software Foundation views Linux distributions which use GNU software as GNU variants and they ask that such operating systems be referred to as GNU/Linux or a Linux-based GNU system.
However, the media and population at large refers to this family of operating systems simply as Linux.
While some distributors make a point of using the aggregate form, most notably Debian with the Debian GNU/Linux distribution, the term's use outside of the enthusiast community is limited.
The distinction between the Linux kernel and distributions based on it plus the GNU system is a source of confusion to many newcomers, and the naming remains controversial, as many large Linux distributions (e.g. Ubuntu and SuSE Linux) are simply using the Linux name, rather than GNU/Linux.
List of chatterbots
Chatterbot Directories
Chatterbot Central at The Simon Laven Page
The Chatterbot Collection
AI Hub - A directory of news, programs, and links all related to chatterbots and Artificial Intelligence
The Chatterbox Challenge Bots Directory at The Chatterbox Challenge
Classic Chatterbots
Dr. Sbaitso
ELIZA
PARRY
Racter
General Chatterbots
A.L.I.C.E. and other Alicebot/pandorabot-based (iGod, Mitsuku, FriendBot, etc.)
Albert One
ALIMbot
CHAT and TIPS
Chat-bot
Claude
Dadorac
DAI2 - A dynamic artificial intelligence which learns from its surrounding community
Elbot
Ella
Fred
Jabberwacky
Jabberwock
Jeeney AI
JIxperts – collection of wiki chatterbots.
KAR Intelligent Computer
Kyle – A unique learning Artificial Intelligence chatbot, which employs contextual learning algorithms.
MegaHal
Mr Know-It-All
Oliverbot
Poseidon
RoboMatic X1 - A chatbot which controls the user's PC through chatting by their voice or by typing.
Splotchy
Spookitalk - A chatterbot used for NPCs in Douglas Adams' Starship Titanic video game.
Thomas
Ultra Hal Assistant
Verbot
Yhaken
ScientioBot - A new technology chatterbot using concept mining techniques accessible via a free web service.
NICOLE A simple chatterbot with the ability to learn new phrases.
IM Chatterbots
DAI2 is also available on the MSN / Windows Live network as dai2@dai2.co.uk
MSN Quickbot
SmarterChild
Spleak
MrMovie - searching actors/movies/dvd's in IM (Skype, AOL/AIM or MSN/Live)
InsideMessenger
Inocu - (MSN/Live)
FriendBot-An AIM Chatterbot
amsnEliza plugin for aMSN
TrixieMouse
Infobot - Polish informational bot for Gadu-gadu, Skype and Jabber
AIML Chatterbots
Alan - In Turing Enigma Alan Turing's spirit has infiltrated the World War II encrypting device Enigma.
Deeb0t
Chomsky A chatbot that uses a smiley face to convey emotions.
It uses the information in Wikipedia to build its conversations and has links to Wikipedia articles.
John Lennon Artificial Intelligence Project
SitePal
JFred Chatterbots
The Turing Hub
Educational Chatterbots
Elizabeth Aims to teach AI techniques and concepts, starting from chatterbot design.
Accompanied by self-teaching materials, as used at the University of Leeds.
Non-English Chatterbots
Amanda - (French) with source code for Windows.
Proteus
[msnim:chat?contact=senhorbot@hotmail.com Senhor Bot] (Brazillian bot for MSN)
Loebner prize
The Loebner Prize is an annual competition that awards prizes to the Chatterbot considered by the judges to be the most humanlike of those entered.
The format of the competition is that of a standard Turing test.
In the Loebner Prize, as in a Turing test, a human judge is faced with two computer screens.
One is under the control of a computer, the other is under the control of a human.
The judge poses questions to the two screens and receives answers.
Based upon the answers, the judge must decide which screen is controlled by the human and which is controlled by the computer program.
The contest was begun in 1990 by Hugh Loebner in conjunction with the Cambridge Center for Behavioral Studies of Massachusetts, United States.
It has since been associated with Flinders University, Dartmouth College, the Science Museum in London, and most recently the University of Reading.
Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, Marvin Minsky, has called it a publicity stunt that does not help the field along.
Prizes
The prizes for each year include:
$2,000 for the most human-seeming of all chatterbots for that year - awarded every year.
In 2005, the prize was increased to $3,000, and the prize was $2,250 in 2006.
In 2008 the prize will be $3000.00
$25,000 for the first chatterbot that judges cannot distinguish from a real human in a text-only Turing test, and that can convince judges that the other (human) entity they are talking to simultaneously is a computer.
(to be awarded once only)
$100,000 to the first chatterbot that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input.
(to be awarded once only)
The Loebner Prize dissolves once the $100,000 prize is won.
2008 Loebner Prize
The 2008 Competition is to be held on Sunday 12 October in University of Reading, UK.
The event, which is being co-directed by Kevin Warwick, will include a direct challenge on the Turing test as originally proposed by Alan Turing.
The first place winner will receive $3000.00 and a bronze medal.
2007 Loebner Prize
The 2007 Competition was held on Sunday, 21 October in New York City.
The participants in the contest were:
Rollo Carpenter from Icogno, creator of Jabberwacky
Noah Duncan, private entry, creator of Cletus
Robert Medeksza from Zabaware, creator of Ultra Hal Assistant
No bot passed the Turing test but the judges ranked the bots as "most human".
The results of the contest were:
1st place: Robert Medeksza
2nd place: Noah Duncan
3rd place: Rollo Carpenter
The winner received $2250 and the Annual Medal.
The runners up received $250 each.
2006 Loebner Prize
On Wednesday, August 30, the finalists for the 2006 Loebner Prize were announced.
The finalists were:
Rollo Carpenter
Richard Churchill and Marie-Claire Jenkins
Noah Duncan
Robert Medeksza
The contest was held on Sunday, 17 September at the Torrington Theatre, University College London.
Winners
Machine learning
As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn".
At a general level, there are two types of learning: inductive, and deductive.
Inductive machine learning methods extract rules and patterns out of massive data sets.
The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods.
Hence, machine learning is closely related not only to data mining and statistics, but also theoretical computer science.
Applications
Machine learning has a wide spectrum of applications including natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics, brain-machine interfaces and cheminformatics, detecting credit card fraud, stock market analysis, classifying DNA sequences, speech and handwriting recognition, object recognition in computer vision, game playing and robot locomotion.
Human interaction
Some machine learning systems attempt to eliminate the need for human intuition in the analysis of the data, while others adopt a collaborative approach between human and machine.
Human intuition cannot be entirely eliminated since the designer of the system must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data.
Machine learning can be viewed as an attempt to automate parts of the scientific method.
Some statistical machine learning researchers create methods within the framework of Bayesian statistics.
Algorithm types
Machine learning algorithms are organized into a taxonomy, based on the desired outcome of the algorithm.
Common algorithm types include:
Supervised learning — in which the algorithm generates a function that maps inputs to desired outputs.
One standard formulation of the supervised learning task is the classification problem: the learner is required to learn (to approximate) the behavior of a function which maps a vector into one of several classes by looking at several input-output examples of the function.
Unsupervised learning — An agent which models a set of inputs: labeled examples are not available.
Semi-supervised learning — which combines both labeled and unlabeled examples to generate an appropriate function or classifier.
Reinforcement learning — in which the algorithm learns a policy of how to act given an observation of the world.
Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
Transduction — similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and test inputs which are available while training.
Leaning to learn — in which the algorithm learns its own inductive bias based on previous experience.
The computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory.
Machine learning topics
This list represents the topics covered on a typical machine learning course.
Prerequisites
Bayesian theory
Modeling conditional probability density functions: regression and classification
Artificial neural networks
Decision trees
Gene expression programming
Genetic algorithms
Genetic programming
Holographic associative memory
Inductive Logic Programming
Gaussian process regression
Linear discriminant analysis
K-nearest neighbor
Minimum message length
Perceptron
Quadratic classifier
Radial basis function networks
Support vector machines
Algorithms for estimating model parameters:
Dynamic programming
Expectation-maximization algorithm
Modeling probability density functions through generative models:
Graphical models including Bayesian networks and Markov random fields
Generative topographic map
Approximate inference techniques
Monte Carlo methods
Variational Bayes
Variable-order Markov models
Variable-order Bayesian networks
Loopy belief propagation
Optimization
Most of methods listed above either use optimization or are instances of optimization algorithms
Meta-learning (ensemble methods)
Boosting
Bootstrap aggregating
Random forest
Weighted majority algorithm
Inductive transfer and learning to learn
Inductive transfer
Reinforcement learning
Temporal difference learning
Monte-Carlo method
Machine translation
Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.
At its basic level, MT performs simple substitution of words in one natural language for words in another.
Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies.
Current machine translation software often allows for customisation by domain or profession (such as weather reports) — improving output by limiting the scope of allowable substitutions.
This technique is particularly effective in domains where formal or formulaic language is used.
It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text.
Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names.
With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is".
However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.
History
The history of machine translation begins in the 1950s, after World War II.
The Georgetown experiment (1954) involved fully-automatic translation of over sixty Russian sentences into English.
The experiment was a great success and ushered in an era of substantial funding for machine-translation research.
The authors claimed that within three to five years, machine translation would be a solved problem.
Real progress was much slower, however, and after the ALPAC report (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced.
Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation.
The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A.D.Booth and possibly others.
The Georgetown experiment was by no means the first such application, and a demonstration was made in 1954 on the APEXC machine at Birkbeck College (London Univ.) of a rudimentary translation of English into French.
Several papers on the topic were published at the time, and even articles in popular journals (see for example Wireless World, Sept. 1955, Cleave and Zacharov).
A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer.
Recently, Internet has emerged as global information infrastructure, revolutionizing access to any information, as well as fast information transfer and exchange.
Using Internet and e-mail technology, people need to communicate rapidly over long distances across continent boundaries.
Not all of these Internet users, however, can use their own language for global communication to different people with different languages.
Therefore, using machine translation software, people can possibly communicate and contact one to another around the world in their own mother tongue, in the near future.
Translation process
The translation process may be stated as:
Decoding the meaning of the source text; and
Re-encoding this meaning in the target language.
Behind this ostensibly simple procedure lies a complex cognitive operation.
To decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the grammar, semantics, syntax, idioms, etc., of the source language, as well as the culture of its speakers.
The translator needs the same in-depth knowledge to re-encode the meaning in the target language.
Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the target language that "sounds" as if it has been written by a person.
This problem may be approached in a number of ways.
Approaches
Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language.
It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first.
Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated.
According to the nature of the intermediary representation, an approach is described as interlingual machine translation or transfer-based machine translation.
These methods require extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules.
Given enough data, machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker.
The difficulty is getting enough data of the right kind to support the particular method.
For example, the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods.
But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.
To translate between closely related languages, a technique referred to as shallow-transfer machine translation may be used.
Rule-based
The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms.
Transfer-based machine translation
Interlingual
Interlingual machine translation is one instance of rule-based machine-translation approaches.
In this approach, the source language, i.e. the text to be translated, is transformed into an interlingual, i.e. source-/target-language-independent representation.
The target language is then generated out of the interlingua.
Dictionary-based
Machine translation can use a method based on dictionary entries, which means that the words will be translated as they are by a dictionary.
Statistical
Statistical machine translation tries to generate translations using statistical methods based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the European Parliament.
Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare.
The first statistical machine translation software was CANDIDE from IBM.
Google used SYSTRAN for several years, but has switched to a statistical translation method in October 2007.
Recently, they improved their translation capabilities by inputting approximately 200 billion words from United Nations materials to train their system.
Accuracy of the translation has improved.
Example-based
Example-based machine translation (EBMT) approach is often characterised by its use of a bilingual corpus as its main knowledge base, at run-time.
It is essentially a translation by analogy and can be viewed as an implementation of case-based reasoning approach of machine learning.
Major issues
Disambiguation
Word sense disambiguation concerns finding a suitable translation when a word can have more than one meaning.
The problem was first raised in the 1950s by Yehoshua Bar-Hillel.
He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word.
Today there are numerous approaches designed to overcome this problem.
They can be approximately divided into "shallow" approaches and "deep" approaches.
Shallow approaches assume no knowledge of the text.
They simply apply statistical methods to the words surrounding the ambiguous word.
Deep approaches presume a comprehensive knowledge of the word.
So far, shallow approaches have been more successful.
Named entities
Related to named entity recognition in information extraction.
Applications
There are now many software programs for translating natural language, several of them online, such as the SYSTRAN system which powers both Google translate and AltaVista's Babel Fish as well as Promt that powers online translation services at Voila.fr and Orange.fr.
Although no system provides the holy grail of "fully automatic high quality machine translation" (FAHQMT), many systems produce reasonable output.
Despite their inherent limitations, MT programs are used around the world.
Probably the largest institutional user is the European Commission.
Toggletext uses a transfer-based system (known as Kataku) to translate between English and Indonesian.
Google has claimed that promising results were obtained using a proprietary statistical machine translation engine.
The statistical translation engine used in the Google language tools for Arabic <-> English and Chinese <-> English has an overall score of 0.4281 over the runner-up IBM's BLEU-4 score of 0.3954 (Summer 2006) in tests conducted by the National Institute for Standards and Technology.
Uwe Muegge has implemented a demo website that uses a controlled language in combination with the Google tool to produce fully automatic, high-quality machine translations of his English, German, and French web sites.
With the recent focus on terrorism, the military sources in the United States have been investing significant amounts of money in natural language engineering.
In-Q-Tel (a venture capital fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like Language Weaver.
Currently the military community is interested in translation and processing of languages like Arabic, Pashto, and Dari.
Information Processing Technology Office in DARPA hosts programs like TIDES and Babylon Translator.
US Air Force has awarded a $1 million contract to develop a language translation technology.
Evaluation
There are various means for evaluating the performance of machine-translation systems.
The oldest is the use of human judges to assess a translation's quality.
Even though human evaluation is time-consuming, it is still the most reliable way to compare different systems such as rule-based and statistical systems.
Automated means of evaluation include BLEU, NIST and METEOR.
Relying exclusively on machine translation ignores that communication in human language is context-embedded, and that it takes a human to adequately comprehend the context of the original text.
Even purely human-generated translations are prone to error.
Therefore, to ensure that a machine-generated translation will be of publishable quality and useful to a human, it must be reviewed and edited by a human.
It has, however, been asserted that in certain applications, e.g. product descriptions written in a controlled language, a dictionary-based machine-translation system has produced satisfactory translations that require no human intervention.
Metadata
Metadata (meta data, or sometimes metainformation) is "data about data", of any sort in any media.
An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.
Purpose
Metadata provides context for data.
Metadata is used to facilitate the understanding, characteristics, and management usage of data.
The metadata required for effective data management varies with the type of data and context of use.
In a library, where the data is the content of the titles stocked, metadata about a title would typically include a description of the content, the author, the publication date and the physical location.
Examples of Metadata
Camera
In the context of a camera, where the data is the photographic image, metadata would typically include the date the photograph was taken and details of the camera settings (lens, focal length, aperture, shutter timing, white balance, etc.).
Digital Music Player
On a digital portable music player, the album names, song titles and album art embedded in the music files are used to generate the artist and song listings, and are considered the metadata.
Information system
In the context of an information system, where the data is the content of the computer files, metadata about an individual data item would typically include the name of the field and its length.
Metadata about a collection of data items, a computer file, might typically include the name of the file, the type of file and the name of the data administrator.
Italic text
Real world location
If we consider a particular place in the real world, this may be described by data, for example:
1 "E83BJ" .
2 "17"
3 "Sunny"
To make sense of and use this data, context is important, and can be provided by metadata.
The metadata for the above three items of data might include:
1.1 "Post Code" – This is a brief description (or name) of the data item "E83BJ"
1.2 "The unique identifier of a postal district" – This is another description (a definition) of "E83BJ"
1.3 "27 June 2006" – This could also help describe "E83BJ", for example by giving the date it was last updated
2 "Average temperature in degrees Celsius" – This is a possible description of "17"
3 "Yesterday's weather" – This is a description of "sunny"
An item of metadata is itself data and therefore may have its own metadata.
For example, "Post Code" might have the following metadata:
1.1.1 "data item name"
1.1.2 "5 characters, starting with A – Z"
"27 June 2006" might have the following metadata:
1.3.1 "date last changed"
1.3.2 "dd MMM yyyy"
Levels
The hierarchy of metadata descriptions can go on forever, but usually context or semantic understanding makes extensively detailed explanations unnecessary.
The role played by any particular datum depends on the context.
For example, when considering the geography of London, "E83BJ" would be a datum and "Post Code" would be metadatum.
But, when considering the data management of an automated system that manages geographical data, "Post Code" might be a datum and then "data item name" and "5 characters, starting with A – Z" would be metadata.
In any particular context, metadata characterizes the data it describes, not the entity described by that data.
So, in relation to "E83BJ", the datum "is in London" is a further description of the place in the real world which has the post code "E83BJ", not of the code itself.
Therefore, although it is providing information connected to "E83BJ" (telling us that this is the post code of a place in London), this would not normally be considered metadata, as it is describing "E83BJ" qua place in the real world and not qua data.
Definitions
Etymology
Meta is a classical Greek preposition (μετ’ αλλων εταιρων) and prefix (μεταβασις) conveying the following senses in English, depending upon the case of the associated noun: among; along with; with; by means of; in the midst of; after; behind.
In epistemology, the word means "about (its own category)"; thus metadata is "data about the data".
Varying definitions
The term was introduced intuitively, without a formal definition.
Because of that, today there are various definitions.
The most common one is the literal translation:
"Data about data are referred to as metadata."
Example: "12345" is data, and with no additional context is meaningless.
When "12345" is given a meaningful name (metadata) of "ZIP code", one can understand (at least in the United States, and further placing "ZIP code" within the context of a postal address) that "12345" refers to the General Electric plant in Schenectady, New York.
As for most people the difference between data and information is merely a philosophical one of no relevance in practical use, other definitions are:
Metadata is information about data.
Metadata is information about information.
Metadata contains information about that data or other data
There are more sophisticated definitions, such as:
"Metadata is structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities."
"[Metadata is a set of] optional structured descriptions that are publicly available to explicitly assist in locating objects."
These are used more rarely because they tend to concentrate on one purpose of metadata — to find "objects", "entities" or "resources" — and ignore others, such as using metadata to optimize compression algorithms, or to perform additional computations using the data.
The metadata concept has been extended into the world of systems to include any "data about data": the names of tables, columns, programs, and the like.
Different views of this "system metadata" are detailed below, but beyond that is the recognition that metadata can describe all aspects of systems: data, activities, people and organizations involved, locations of data and processes, access methods, limitations, timing and events, as well as motivation and rules.
Fundamentally, then, metadata is "the data that describe the structure and workings of an organization's use of information, and which describe the systems it uses to manage that information".
To do a model of metadata is to do an "Enterprise model" of the information technology industry itself.
Metadata and Markup
In the context of the web and the work of the W3C in providing markup technologies of HTML, XML and SGML the concept of metadata has specific context that is perhaps clearer than in other information domains.
With markup technologies there is metadata, markup and data content.
The metadata describes characteristics about the data, while the markup identifies the specific type of data content and acts as a container for that document instance.
This page in Wikipedia is itself an example of such usage, where the textual information is data, how it is packaged, linked, referenced, styled and displayed is markup and aspects and characteristics of that markup are metadata set globally across Wikipedia.
In the context of markup the metadata is architected to allow optimization of document instances to contain only a minimum amount of metadata, while the metadata itself is likely referenced externally such as in a schema definition (XSD) instance.
Also it should be noted that markup provides specialised mechanisms that handle referential data, again avoiding confusion over what is metadata or data, and allowing optimizations.
The reference and ID mechanisms in markup allowing reference links between related data items, and links to data items that can then be repeated about a data item, such as an address or product details.
These are then all themselves simply more data items and markup instances rather than metadata.
Similarly there are concepts such as classifications, ontologies and associations for which markup mechanisms are provided.
A data item can then be linked to such categories via markup and hence providing a clean delineation between what is metadata, and actual data instances.
Therefore the concepts and descriptions in a classification would be metadata, but the actual classification entry for a data item is simply another data instance.
Some examples can illustrate the points here.
Items in bold are data content, in italic are metadata, normal text items are all markup.
The two examples show in-line use of metadata within markup relating to a data instance (XML) compared to simple markup (HTML).
A simple HTML instance example:
<span style="normalText">Example</span>
And then a XML instance example with metadata:
John
Where the inline assertion that a person's middle name may be an empty data item is metadata about the data item.
Such definitions however are usually not placed inline in XML.
Instead these definitions are moved away into the schema definition that contains the metadata for the entire document instance.
This again illustrates another important aspect of metadata in the context of markup.
The metadata is optimally defined only once for a collection of data instances.
Hence repeated items of markup are rarely metadata, but rather more markup data instances themselves.
Hierarchies of metadata
When structured into a hierarchical arrangement, metadata is more properly called an ontology or schema.
Both terms describe "what exists" for some purpose or to enable some action.
For instance, the arrangement of subject headings in a library catalog serves not only as a guide to finding books on a particular subject in the stacks, but also as a guide to what subjects "exist" in the library's own ontology and how more specialized topics are related to or derived from the more general subject headings.
Metadata is frequently stored in a central location and used to help organizations standardize their data.
This information is typically stored in a metadata registry.
Difference between data and metadata
Usually it is not possible to distinguish between (plain) data and metadata because:
Something can be data and metadata at the same time.
The headline of an article is both its title (metadata) and part of its text (data).
Data and metadata can change their roles.
A poem, as such, would be regarded as data, but if there were a song that used it as lyrics, the whole poem could be attached to an audio file of the song as metadata.
Thus, the labeling depends on the point of view.
These considerations apply no matter which of the above definitions is considered, except where explicit markup is used to denote what is data and what is metadata.
Use
Metadata has many different applications; this section lists some of the most common.
Metadata is used to speed up and enrich searching for resources.
In general, search queries using metadata can save users from performing more complex filter operations manually.
It is now common for web browsers (with the notable exception of Mozilla Firefox), P2P applications and media management software to automatically download and locally cache metadata, to improve the speed at which files can be accessed and searched.
Metadata may also be associated to files manually.
This is often the case with documents which are scanned into a document storage repository such as FileNet or Documentum.
Once the documents have been converted into an electronic format a user brings the image up in a viewer application, manually reads the document and keys values into an online application to be stored in a metadata repository.
Metadata provide additional information to users of the data it describes.
This information may be descriptive ("These pictures were taken by children in the school's third grade class.") or algorithmic ("Checksum=139F").
Metadata helps to bridge the semantic gap.
By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process even more complex filter and search operations.
For example, if a search engine understands that "Van Gogh" was a "Dutch painter", it can answer a search query on "Dutch painters" with a link to a web page about Vincent Van Gogh, although the exact words "Dutch painters" never occur on that page.
This approach, called knowledge representation, is of special interest to the semantic web and artificial intelligence.
Certain metadata is designed to optimize lossy compression.
For example, if a video has metadata that allows a computer to tell foreground from background, the latter can be compressed more aggressively to achieve a higher compression rate.
Some metadata is intended to enable variable content presentation.
For example, if a picture has metadata that indicates the most important region — the one where there is a person — an image viewer on a small screen, such as on a mobile phone's, can narrow the picture to that region and thus show the user the most interesting details.
A similar kind of metadata is intended to allow blind people to access diagrams and pictures, by converting them for special output devices or reading their description using text-to-speech software.
Other descriptive metadata can be used to automate workflows.
For example, if a "smart" software tool knows content and structure of data, it can convert it automatically and pass it to another "smart" tool as input.
As a result, users save the many copy-and-paste operations required when analyzing data with "dumb" tools.
Metadata is becoming an increasingly important part of electronic discovery.
www.lexbe.com/hp/indepth-e-discovery-rule-metadata.htm Application and file system metadata derived from electronic documents and files can be important evidence.
Recent changes to the Federal Rules of Civil Procedure make metadata routinely discoverable as part of civil litigation.
Parties to litigation are required to maintain and produce metadata as part of discovery, and spoliation of metadata can lead to sanctions.
Metadata has become important on the World Wide Web because of the need to find useful information from the mass of information available.
Manually-created metadata adds value because it ensures consistency.
If a web page about a certain topic contains a word or phrase, then all web pages about that topic should contain that same word or phrase.
Metadata also ensures variety, so that if a topic goes by two names each will be used.
For example, an article about "sport utility vehicles" would also be tagged "4 wheel drives", "4WDs" and "four wheel drives", as this is how SUVs are known in some countries.
Examples of metadata for an audio CD include the MusicBrainz project and All Media Guide's Allmusic.
Similarly, MP3 files have metadata tags in a format called ID3.
Types of metadata
Metadata can be classified by:
Content.
Metadata can either describe the resource itself (for example, name and size of a file) or the content of the resource (for example, "This video shows a boy playing football").
Mutability.
With respect to the whole resource, metadata can be either immutable (for example, the "Title" of a video does not change as the video itself is being played) or mutable (the "Scene description" does change).
Logical function.
There are three layers of logical function: at the bottom the subsymbolic layer that contains the raw data itself, then the symbolic layer with metadata describing the raw data, and on the top the logical layer containing metadata that allows logical reasoning using the symbolic layer
Important issues
To successfully develop and use metadata, several important issues should be treated with care:
Metadata risks
Microsoft Office files include metadata beyond their printable content, such as the original author's name, the creation date of the document, and the amount of time spent editing it.
Unintentional disclosure can be awkward or even, in professional practices requiring confidentiality, raise malpractice concerns.
Some of Microsoft Office document's metadata can be seen by clicking File then Properties from the program's menu.
Other metadata is not visible except through external analysis of a file, such as is done in forensics.
The author of the Microsoft Word-based Melissa computer virus in 1999 was caught due to Word metadata that uniquely identified the computer used to create the original infected document.
Metadata lifecycle
Even in the early phases of planning and designing it is necessary to keep track of all metadata created.
It is not economical to start attaching metadata only after the production process has been completed.
For example, if metadata created by a digital camera at recording time is not stored immediately, it may have to be restored afterwards manually with great effort.
Therefore, it is necessary for different groups of resource producers to cooperate using compatible methods and standards.
Manipulation.
Metadata must adapt if the resource it describes changes.
It should be merged when two resources are merged.
These operations are seldom performed by today's software; for example, image editing programs usually do not keep track of the Exif metadata created by digital cameras.
Destruction.
It can be useful to keep metadata even after the resource it describes has been destroyed, for example in change histories within a text document or to archive file deletions due to digital rights management.
None of today's metadata standards consider this phase.
Storage
Metadata can be stored either internally, in the same file as the data, or externally, in a separate file.
Metadata that are embedded with content is called embedded metadata.
A data repository typically stores the metadata detached from the data.
Both ways have advantages and disadvantages:
Internal storage allows transferring metadata together with the data it describes; thus, metadata is always at hand and can be manipulated easily.
This method creates high redundancy and does not allow holding metadata together.
External storage allows bundling metadata, for example in a database, for more efficient searching.
There is no redundancy and metadata can be transferred simultaneously when using streaming.
However, as most formats use URIs for that purpose, the method of how the metadata is linked to its data should be treated with care.
What if a resource does not have a URI (resources on a local hard disk or web pages that are created on-the-fly using a content management system)?
What if metadata can only be evaluated if there is a connection to the Web, especially when using RDF?
How to realize that a resource is replaced by another with the same name but different content?
Moreover, there is the question of data format: storing metadata in a human-readable format such as XML can be useful because users can understand and edit it without specialized tools.
On the other hand, these formats are not optimized for storage capacity; it may be useful to store metadata in a binary, non-human-readable format instead to speed up transfer and save memory.
Criticisms
Although the majority of computer scientists see metadata as a chance for better interoperability, some critics argue:
Metadata is too expensive and time-consuming.
The argument is that companies will not produce metadata without need because it costs extra money, and private users also will not produce complex metadata because its creation is very time-consuming.
Metadata is too complicated.
Private users will not create metadata because existing formats, especially MPEG-7, are too complicated.
As long as there are no automatic tools for creating metadata, it will not be created.
Metadata is subjective and depends on context.
Most probably, two persons will attach different metadata to the same resource due to their different points of view.
Moreover, metadata can be misinterpreted due to its dependency on context.
For example searching for "post-modern art" may miss a certain item because the expression was not in use at the time when that work of art was created, or searching for "pictures taken at 1:00" may produce confusing results due to local time differences.
There is no end to metadata.
For example, when annotating a match of soccer with metadata, one can describe all the players and their actions in time and stop there.
One can also describe the advertisements in the background and the clothes the players wear.
One can also describe each fan on the tribune and the clothes they wear.
All of this metadata can be interesting to one party or another — such as the spectators, sponsors or a counter-terrorist unit of the police — and even for a simple resource the amount of possible metadata can be gigantic.
Metadata is useless.
Many of today's search engines are very efficient at finding text.
Other techniques for finding pictures, videos and music (namely query-by-example) will become more and more powerful in the future.
Thus, there is no real need for metadata.
The opposers of metadata sometimes use the term metacrap to refer to the unsolved problems of metadata in some scenarios.
These people are also referred to as "Meta Haters."
Types
In general, there are two distinct classes of metadata: structural or control metadata and guide metadata.
Structural metadata is used to describe the structure of computer systems such as tables, columns and indexes.
Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language.
Metatadata can be divided into 3 distinct categories:
Descriptive
Administrative
Structural
Relational database metadata
Each relational database system has its own mechanisms for storing metadata.
Examples of relational-database metadata include:
Tables of all tables in database, their names, sizes and number of rows in each table.
Tables of columns in each database, what tables they are used in, and the type of data stored in each column.
In database terminology, this set of metadata is referred to as the catalog.
The SQL standard specifies a uniform means to access the catalog, called the
, but not all databases implement it, even if they implement other aspects of the SQL standard.
For an example of database-specific metadata access methods, see Oracle metadata.
Data warehouse metadata
Data warehouse metadata systems are sometimes separated into two sections:
back room metadata that are used for Extract, transform, load functions to get OLTP data into a data warehouse
front room metadata that are used to label screens and create reports
Kimball lists the following types of metadata in a data warehouse (See also www.fortunecity.com/skyscraper/oracle/699/orahtml/dbmsmag/9803d05.html):
source system metadata
source specifications, such as repositories, and source logical schemas
source descriptive information, such as ownership descriptions, update frequencies, legal limitations, and access methods
process information, such as job schedules and extraction code
data staging metadata
data acquisition information, such as data transmission scheduling and results, and file usage
dimension table management, such as definitions of dimensions, and surrogate key assignments
transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions
audit, job logs and documentation, such as data lineage records, data transform logs
DBMS metadata, such as:
DBMS system table contents
processing hints
Michael Bracket defines metadata (what he calls "Data resource data") as "any data about the organization's data resource".
Adrienne Tannenbaum defines metadata as "the detailed description of instance data.
The format and characteristics of populated instance data: instances and values, dependent on the role of the metadata recipient".
These definitions are characteristic of the "data about data" definition.
Business Intelligence metadata
Business Intelligence is the process of analyzing large amounts of corporate data, usually stored in large databases such as the Data Warehouse, tracking business performance, detecting patterns and trends, and helping enterprise business users make better decisions.
Business Intelligence metadata describes how data is queried, filtered, analyzed, and displayed in Business Intelligence software tools, such as Reporting tools, OLAP tools, Data Mining tools.
Examples:
OLAP metadata: The descriptions and structures of Dimensions, Cubes, Measures (Metrics), Hierarchies, Levels, Drill Paths
Reporting metadata: The descriptions and structures of Reports, Charts, Queries, DataSets, Filters, Variables, Expressions
Data Mining metadata: The descriptions and structures of DataSets, Algorithms, Queries
Business Intelligence metadata can be used to understand how corporate financial reports reported to Wall Street are calculated, how the revenue, expense and profit are aggregated from individual sales transactions stored in the data warehouse.
A good understanding of Business Intelligence metadata is required to solve complex problems such as compliance with corporate governance standards, such as Sarbanes Oxley (SOX) or Basel II.
General IT metadata
In contrast, David Marco, another metadata theorist, defines metadata as "all physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation."
Others have included web services, systems and interfaces.
In fact, the entire Zachman framework (see Enterprise Architecture) can be represented as metadata.
Notice that such definitions expand metadata's scope considerably, to encompass most or all of the data required by the Management Information Systems capability.
In this sense, the concept of metadata has significant overlaps with the ITIL concept of a Configuration Management Database (CMDB), and also with disciplines such as Enterprise Architecture and IT portfolio management.
This broader definition of metadata has precedent.
Third generation corporate repository products (such as those eventually merged into the CA Advantage line) not only store information about data definitions (COBOL copybooks, DBMS schema), but also about the programs accessing those data structures, and the Job Control Language and batch job infrastructure dependencies as well.
These products (some of which are still in production) can provide a very complete picture of a mainframe computing environment, supporting exactly the kinds of impact analysis required for ITIL-based processes such as Incident and Change Management.
The ITIL Back Catalogue includes the Data Management volume which recognizes the role of these metadata products on the mainframe, posing the CMDB as the distributed computing equivalent.
CMDB vendors however have generally not expanded their scope to include data definitions, and metadata solutions are also available in the distributed world.
Determining the appropriate role and scope for each is thus a challenge for large IT organizations requiring the services of both.
Since metadata is pervasive, centralized attempts at tracking it need to focus on the most highly leveraged assets.
Enterprise Assets may only constitute a small percentage of the entire IT portfolio.
Some practitioners have successfully managed IT metadata using the Dublin Core metamodel.
IT metadata management products
First generation data dictionary/metadata repository tools would be those only supporting a specific DBMS, such as IDMS's IDD (integrated data dictionary), the IMS Data Dictionary, and ADABAS's Predict.
Second generation would be ASG's DATAMANAGER product which could support many different file and DBMS types.
Third generation repository products became briefly popular in the early 1990s along with the rise of widespread use of RDBMS engines such as IBM's DB2.
Fourth generation products link the repository with more Extract, transform, load tools and can be connected with architectural modeling tools.
Examples include Adaptive Metadata Manager from Adaptive, Rochade from ASG,InfoLibrarian Metadata Integration Framework and Troux Technologies Metis Server product.
File system metadata
Nearly all file systems keep metadata about files out-of-band.
Some systems keep metadata in directory entries; others in specialized structure like inodes or even in the name of a file.
Metadata can range from simple timestamps, mode bits, and other special-purpose information used by the implementation itself, to icons and free-text comments, to arbitrary attribute-value pairs.
With more complex and open-ended metadata, it becomes useful to search for files based on the metadata contents.
The Unix find utility was an early example, although inefficient when scanning hundreds of thousands of files on a modern computer system.
Apple Computer's Mac OS X operating system supports cataloguing and searching for file metadata through a feature known as Spotlight, as of version 10.4.
Microsoft worked in the development of similar functionality with the Instant Search system in Windows Vista, as well as being present in SharePoint Server.
Linux implements file metadata using extended file attributes.
Image metadata
Examples of image files containing metadata include Exchangeable image file format (EXIF) and Tagged Image File Format (TIFF).
Having metadata about images embedded in TIFF or EXIF files is one way of acquiring additional data about an image.
Tagging pictures with subjects, related emotions, and other descriptive phrases helps Internet users find pictures easily rather than having to search through entire image collections.
A prime example of an image tagging service is Flickr, where users upload images and then describe the contents.
Other patrons of the site can then search for those tags.
Flickr uses a folksonomy: a free-text keyword system in which the community defines the vocabulary through use rather than through a controlled vocabulary.
Users can also tag photos for organization purposes using Adobe's Extensible Metadata Platform (XMP) language, for example.
Digital photography is increasingly making use of technical metadata tags describing the conditions of exposure.
Photographers shooting Camera RAW file formats can use applications such as Adobe Bridge or Apple Computer's Aperture to work with camera metadata for post-processing.
Audio Metadata
Audio metadata generally relates to the how the data should be written in order for a processor to efficiently process it.
These technologies are usually seen in Audio Engine Programming such as Microsoft RIFF (Resource Interchange File Format) technologies for .wave file.
Codes generally develop their own metadata standards for compression purpose.
Program metadata
Metadata is casually used to describe the controlling data used in software architectures that are more abstract or configurable.
Most executable file formats include what may be termed "metadata" that specifies certain, usually configurable, behavioral runtime characteristics.
However, it is difficult if not impossible to precisely distinguish program "metadata" from general aspects of stored-program computing architecture; if the machine reads it and acts upon it, it is a computational instruction, and the prefix "meta" has little significance.
In Java, the class file format contains metadata used by the Java compiler and the Java virtual machine to dynamically link classes and to support reflection.
The J2SE 5.0 version of Java included a metadata facility to allow additional annotations that are used by development tools.
In MS-DOS, the COM file format does not include metadata, while the EXE file and Windows PE formats do.
These metadata can include the company that published the program, the date the program was created, the version number and more.
In the Microsoft .NET executable format, extra metadata is included to allow reflection at runtime.
Existing software metadata
Object Management Group (OMG) has defined metadata format for representing entire existing applications for the purposes of software mining, software modernization and software assurance.
This specification, called the OMG Knowledge Discovery Metamodel (KDM) is the OMG's foundation for "modeling in reverse".
KDM is a common language-independent intermediate representation that provides an integrated view of an entire enterprise application, including its behavior (program flow), data, and structure.
One of the applications of KDM is Business Rules Mining.
Knowledge Discovery Metamodel includes a fine grained low-level representation (called "micro KDM"), suitable for performing static analysis of programs.
Document metadata
Most programs that create documents, including Microsoft SharePoint, Microsoft Word and other Microsoft Office products, save metadata with the document files.
These metadata can contain the name of the person who created the file (obtained from the operating system), the name of the person who last edited the file, how many times the file has been printed, and even how many revisions have been made on the file.
Other saved material, such as deleted text (saved in case of an undelete command), document comments and the like, is also commonly referred to as "metadata", and the inadvertent inclusion of this material in distributed files has sometimes led to undesirable disclosures.
Document Metadata is particularly important in legal environments where litigation can request this sensitive information (metadata) which can include many elements of private detrimental data.
This data has been linked to multiple lawsuits that have got corporations into legal complications.
Many legal firms today use "Metadata Management Software", also known as "Metadata Removal Tools".
This software can be used to clean documents before they are sent outside of their firm.
This process, known as metadata management, protects lawfirms from potentially unsafe leaking of sensitive data through Electronic Discovery.
For a list of executable formats, see object file.
Metamodels
Metadata on Models are called Metamodels.
In Model Driven Engineering, a Model has to conform to a given Metamodel.
According to the MDA guide, a metamodel is a model and each model conforms to a given metamodel.
Meta-modeling allows strict and agile automatic processing of models and metamodels.
The Object Management Group (OMG) defines 4 layers of meta-modeling.
Each level of modeling is defined, validated by the next layer:
M0: instance object, data row, record -> "John Smith"
M1: model, schema -> "Customer" UML Class or database Table
M2: metamodel -> Unified Modeling Language (UML), Common Warehouse Metamodel (CWM), Knowledge Discovery Metamodel (KDM)
M3: meta-metamodel -> Meta-Object Facility (MOF)
Meta-metadata
Since metadata are also data, it is possible to have metadata of metadata–"meta-metadata."
Machine-generated meta-metadata, such as the reversed index created by a free-text search engine, is generally not considered metadata, though.
Digital library metadata
There are three categories of metadata that are frequently used to describe objects in a digital library:
descriptive - Information describing the intellectual content of the object, such as MARC cataloguing records, finding aids or similar schemes.
It is typically used for bibliographic purposes and for search and retrieval.
structural - Information that ties each object to others to make up logical units (e.g., information that relates individual images of pages from a book to the others that make up the book).
administrative - Information used to manage the object or control access to it.
This may include information on how it was scanned, its storage format, copyright and licensing information, and information necessary for the long-term preservation of the digital objects.
Geospatial metadata
Metadata that describe geographic objects (such as datasets, maps, features, or simply documents with a geospatial component) have a history going back to at least 1994 (refer MIT Library page on FGDC Metadata).
This class of metadata is described more fully on the Geospatial metadata page.
Microsoft Windows
Microsoft Windows is a series of software operating systems produced by Microsoft.
Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces (GUIs).
Microsoft Windows came to dominate the world's personal computer market, overtaking Mac OS, which had been introduced previously.
At the 2004 IDC Directions conference, it was stated that Windows had approximately 90% of the client operating system market.
The most recent client version of Windows is Windows Vista; the current server version is Windows Server 2008.
Versions
The term Windows collectively describes any or all of several generations of Microsoft (MS) operating system (OS) products.
These products are generally categorized as follows:
16-bit operating environments
The early versions of Windows were often thought of as just graphical user interfaces, mostly because they ran on top of MS-DOS and used it for file system services.
However, even the earliest 16-bit Windows versions already assumed many typical operating system functions, notably, having their own executable file format and providing their own device drivers (timer, graphics, printer, mouse, keyboard and sound) for applications.
Unlike MS-DOS, Windows allowed users to execute multiple graphical applications at the same time, through cooperative multitasking.
Finally, Windows implemented an elaborate, segment-based, software virtual memory scheme, which allowed it to run applications larger than available memory: code segments and resources were swapped in and thrown away when memory became scarce, and data segments moved in memory when a given application had relinquished processor control, typically waiting for user input.
16-bit Windows versions include Windows 1.0 (1985), Windows 2.0 (1987) and its close relatives, Windows/286-Windows/386.
Hybrid 16/32-bit operating environments
Windows/386 introduced a 32-bit protected mode kernel and virtual machine monitor.
For the duration of a Windows session, it created one or more virtual 8086 environments and provided device virtualization for the video card, keyboard, mouse, timer and interrupt controller inside each of them.
The user-visible consequence was that it became possible to preemptively multitask multiple MS-DOS environments in separate windows, although graphical MS-DOS applications required full screen mode.
Also, Windows applications were multi-tasked cooperatively inside one such virtual 8086 environment.
Windows 3.0 (1990) and Windows 3.1 (1992) improved the design, mostly because of virtual memory and loadable virtual device drivers (VxDs) which allowed them to share arbitrary devices between multitasked DOS windows.
Also, Windows applications could now run in protected mode (when Windows was running in Standard or 386 Enhanced Mode), which gave them access to several megabytes of memory and removed the obligation to participate in the software virtual memory scheme.
They still ran inside the same address space, where the segmented memory provided a degree of protection, and multi-tasked cooperatively.
For Windows 3.0, Microsoft also rewrote critical operations from C into assembly, making this release faster and less memory-hungry than its predecessors.
Hybrid 16/32-bit operating systems
With the introduction of the 32-bit Windows for Workgroups 3.11, Windows was able to stop relying on DOS for file management.
Leveraging this, Windows 95 introduced Long File Names, reducing the 8.3 filename DOS environment to the role of a boot loader.
MS-DOS was now bundled with Windows; this notably made it (partially) aware of long file names when its utilities were run from within Windows.
The most important novelty was the possibility of running 32-bit multi-threaded preemptively multitasked graphical programs.
However, the necessity of keeping compatibility with 16-bit programs meant the GUI components were still 16-bit only and not fully reentrant, which resulted in reduced performance and stability.
There were three releases of Windows 95 (the first in 1995, then subsequent bug-fix versions in 1996 and 1997, only released to OEMs, which added extra features such as FAT32 and primitive USB support).
Microsoft's next OS was Windows 98; there were two versions of this (the first in 1998 and the second, named "Windows 98 Second Edition", in 1999).
In 2000, Microsoft released Windows Me (Me standing for Millennium Edition), which used the same core as Windows 98 but adopted some aspects of Windows 2000 and removed the option boot into DOS mode.
It also added a new feature called System Restore, allowing the user to set the computer's settings back to an earlier date.
32-bit operating systems
The NT family of Windows systems was fashioned and marketed for higher reliability business use, and was unencumbered by any Microsoft DOS patrimony.
The first release was Windows NT 3.1 (1993, numbered "3.1" to match the Windows version and to one-up OS/2 2.1, IBM's flagship OS co-developed by Microsoft and was Windows NT's main competitor at the time), which was followed by NT 3.5 (1994), NT 3.51 (1995), NT 4.0 (1996), and Windows 2000 (essentially NT 5.0).
NT 4.0 was the first in this line to implement the "Windows 95" user interface (and the first to include Windows 95's built-in 32-bit runtimes).
Microsoft then moved to combine their consumer and business operating systems.
Windows XP, coming in both home and professional versions (and later niche market versions for tablet PCs and media centers) improved stability, user experience and backwards compatibility.
Then, Windows Server 2003 brought Windows Server up to date with Windows XP.
Since then, a new version, Windows Vista was released and Windows Server 2008, released on February 27, 2008, brings Windows Server up to date with Windows Vista.
Windows CE, Microsoft's offering in the mobile and embedded markets, is also a true 32-bit operating system that offers various services for all sub-operating workstations.
64-bit operating systems
Windows NT included support for several different platforms before the x86-based personal computer became dominant in the professional world.
Versions of NT from 3.1 to 4.0 variously supported PowerPC, DEC Alpha and MIPS R4000, some of which were 64-bit processors, although the operating system treated them as 32-bit processors.
With the introduction of the Intel Itanium architecture, which is referred to as IA-64, Microsoft released new versions of Windows to support it.
Itanium versions of Windows XP and Windows Server 2003 were released at the same time as their mainstream x86 (32-bit) counterparts.
On April 25 2005, Microsoft released Windows XP Professional x64 Edition and x64 versions of Windows Server 2003 to support the AMD64/Intel64 (or x64 in Microsoft terminology) architecture.
Microsoft dropped support for the Itanium version of Windows XP in 2005.
Windows Vista is the first end-user version of Windows that Microsoft has released simultaneously in 32-bit and x64 editions.
Windows Vista does not support the Itanium architecture.
The modern 64-bit Windows family comprises AMD64/Intel64 versions of Windows Vista, and Windows Server 2003 and Windows Server 2008, in both Itanium and x64 editions.
History
Microsoft has taken two parallel routes in its operating systems.
One route has been for the home user and the other has been for the professional IT user.
The dual routes have generally led to home versions having greater multimedia support and less functionality in networking and security, and professional versions having inferior multimedia support and better networking and security.
The first version of Microsoft Windows, version 1.0, released in November 1985, lacked a degree of functionality and achieved little popularity, and was to compete with Apple's own operating system.
Windows 1.0 is not a complete operating system; rather, it extends MS-DOS.
Microsoft Windows version 2.0 was released in November, 1987 and was slightly more popular than its predecessor.
Windows 2.03 (release date January 1988) had changed the OS from tiled windows to overlapping windows.
The result of this change led to Apple Computer filing a suit against Microsoft alleging infringement on Apple's copyrights.
Microsoft Windows version 3.0, released in 1990, was the first Microsoft Windows version to achieve broad commercial success, selling 2 million copies in the first six months.www.islandnet.com/~kpolsson/compsoft/soft1991.htmwww.thocp.net/companies/microsoft/microsoft_company.htm
It featured improvements to the user interface and to multitasking capabilities.
It received a facelift in Windows 3.1, made generally available on March 1, 1992.
Windows 3.1 support ended on December 31, 2001.
In July 1993, Microsoft released Windows NT based on a new kernel.
NT was considered to be the professional OS and was the first Windows version to utilize preemptive multitasking..
Windows NT would later be retooled to also function as a home operating system, with Windows XP.
On August 24th 1995, Microsoft released Windows 95, a new, and major, consumer version that made further changes to the user interface, and also used preemptive multitasking.
Windows 95 was designed to replace not only Windows 3.1, but also Windows for Workgroups, and MS-DOS.
It was also the first Windows operating system to use Plug and Play capabilities.
The changes Windows 95 brought to the desktop were revolutionary, as opposed to evolutionary, such as those in Windows 98 and Windows Me.
Mainstream support for Windows 95 ended on December 31, 2000 and extended support for Windows 95 ended on December 31, 2001.
The next in the consumer line was Microsoft Windows 98 released on June 25th, 1998.
It was substantially criticized for its slowness and for its unreliability compared with Windows 95, but many of its basic problems were later rectified with the release of Windows 98 Second Edition in 1999.
Mainstream support for Windows 98 ended on June 30, 2002 and extended support for Windows 98 ended on July 11, 2006.
As part of its "professional" line, Microsoft released Windows 2000 in February 2000.
The consumer version following Windows 98 was Windows Me (Windows Millennium Edition).
Released in September 2000, Windows Me implemented a number of new technologies for Microsoft: most notably publicized was "Universal Plug and Play."
In October 2001, Microsoft released Windows XP, a version built on the Windows NT kernel that also retained the consumer-oriented usability of Windows 95 and its successors.
This new version was widely praised in computer magazines.
It shipped in two distinct editions, "Home" and "Professional", the former lacking many of the superior security and networking features of the Professional edition.
Additionally, the first "Media Center" edition was released in 2002, with an emphasis on support for DVD and TV functionality including program recording and a remote control.
Mainstream support for Windows XP will continue until April 14, 2009 and extended support will continue until April 8, 2014.
In April 2003, Windows Server 2003 was introduced, replacing the Windows 2000 line of server products with a number of new features and a strong focus on security; this was followed in December 2005 by Windows Server 2003 R2.
On January 30, 2007 Microsoft released Windows Vista.
It contains a number of new features, from a redesigned shell and user interface to significant technical changes, with a particular focus on security features.
It is available in a number of different editions, and has been subject to some criticism.
Timeline of releases
Security
Security has been a hot topic with Windows for many years, and even Microsoft itself has been the victim of security breaches.
Consumer versions of Windows were originally designed for ease-of-use on a single-user PC without a network connection, and did not have security features built in from the outset.
Windows NT and its successors are designed for security (including on a network) and multi-user PCs, but are not designed with Internet security in mind as much since, when it was first developed in the early 1990s, Internet use was less prevalent.
These design issues combined with flawed code (such as buffer overflows) and the popularity of Windows means that it is a frequent target of worm and virus writers.
In June 2005, Bruce Schneier's Counterpane Internet Security reported that it had seen over 1,000 new viruses and worms in the previous six months.
Microsoft releases security patches through its Windows Update service approximately once a month (usually the second Tuesday of the month), although critical updates are made available at shorter intervals when necessary.
In Windows 2000 (SP3 and later), Windows XP and Windows Server 2003, updates can be automatically downloaded and installed if the user selects to do so.
As a result, Service Pack 2 for Windows XP, as well as Service Pack 1 for Windows Server 2003, were installed by users more quickly than it otherwise might have been.
Windows Defender
On 6 January 2005, Microsoft released a beta version of Microsoft AntiSpyware, based upon the previously released Giant AntiSpyware.
On 14 February, 2006, Microsoft AntiSpyware became Windows Defender with the release of beta 2.
Windows Defender is a freeware program designed to protect against spyware and other unwanted software.
Windows XP and Windows Server 2003 users who have genuine copies of Microsoft Windows can freely download the program from Microsoft's web site, and Windows Defender ships as part of Windows Vista.
Third-party analysis
In an article based on a report by Symantec, internetnews.com has described Microsoft Windows as having the "fewest number of patches and the shortest average patch development time of the five operating systems it monitored in the last six months of 2006."
And the number of vulnerabilities found in Windows has significantly increased— Windows: 12+, Red Hat + Fedora: 2, Mac OS X: 1, HP-UX: 2, Solaris: 1.
A study conducted by Kevin Mitnick and marketing communications firm Avantgarde in 2004 found that an unprotected and unpatched Windows XP system with Service Pack 1 lasted only 4 minutes on the Internet before it was compromised, and an unprotected and also unpatched Windows Server 2003 system was compromised after being connected to the internet for 8 hours.
However, it is important to note that this study does not apply to Windows XP systems running the Service Pack 2 update (released in late 2004), which vastly improved the security of Windows XP.
The computer that was running Windows XP Service Pack 2 was not compromised.
The AOL National Cyber Security Alliance Online Safety Study of October 2004 determined that 80% of Windows users were infected by at least one spyware/adware product.
Much documentation is available describing how to increase the security of Microsoft Windows products.
Typical suggestions include deploying Microsoft Windows behind a hardware or software firewall, running anti-virus and anti-spyware software, and installing patches as they become available through Windows Update.
Windows Lifecycle Policy
Microsoft has stopped releasing updates and hotfixes for many old Windows operating systems, including all versions of Windows 9x and earlier versions of Windows NT.
Windows versions prior to XP are no longer supported, with the exception of Windows 2000, which is currently in the Extended Support Period, that will end on July 13, 2010.
Windows XP versions prior to SP2 are no longer supported either.
Also, support for Windows XP 64-bit Edition ended after the release of the more recent Windows XP Professional x64 Edition.
No new updates are created for unsupported versions of Windows.
Emulation software
Emulation allows the use of some Windows applications without using Microsoft Windows.
These include:
Wine - a free and open source software implementation of the Windows API, allowing one to run many Windows applications on x86-based platforms, including Linux.
Wine is technically not an emulator but a "compatibility layer"; while an emulator effectively 'pretends' to be a different CPU, Wine instead makes use of Windows-style APIs to 'simulate' the Windows environment directly.
CrossOver - A Wine package with licensed fonts.
Its developers are regular contributors to Wine, and focus on Wine running officially supported applications.
Cedega - TransGaming Technologies' proprietary fork of Wine, designed specifically for running games written for Microsoft Windows under Linux.
Darwine - This project intends to port and develop Wine as well as other supporting tools that will allow Darwin and Mac OS X users to run Microsoft Windows applications, and to provide Win32 API compatibility at application source code level.
ReactOS - An open-source OS that is intended to run the same software as Windows, originally designed to imitate Windows NT 4.0, now aiming at Windows XP compatibility.
It has been in the development stage since 1996.
Morphology (linguistics)
Morphology is the field of linguistics that studies the internal structure of words.
(Words as units in the lexicon are the subject matter of lexicology.)
While words are generally accepted as being (with clitics) the smallest units of syntax, it is clear that in most (if not all) languages, words can be related to other words by rules.
For example, English speakers recognize that the words dog, dogs, and dog-catcher are closely related.
English speakers recognize these relations from their tacit knowledge of the rules of word-formation in English.
They intuit that dog is to dogs as cat is to cats; similarly, dog is to dog-catcher as dish is to dishwasher.
The rules understood by the speaker reflect specific patterns (or regularities) in the way words are formed from smaller units and how those smaller units interact in speech.
In this way, morphology is the branch of linguistics that studies patterns of word-formation within and across languages, and attempts to formulate rules that model the knowledge of the speakers of those languages.
History
The history of morphological analysis dates back to the ancient Indian linguist , who formulated the 3,959 rules of Sanskrit morphology in the text by using a Constituency Grammar.
The Graeco-Roman grammatical tradition also engaged in morphological analysis.
The term morphology was coined by August Schleicher in 1859
Fundamental concepts
Lexemes and word forms
The distinction between these two senses of "word" is arguably the most important one in morphology.
The first sense of "word," the one in which dog and dogs are "the same word," is called lexeme.
The second sense is called word-form.
We thus say that dog and dogs are different forms of the same lexeme.
Dog and dog-catcher, on the other hand, are different lexemes; for example, they refer to two different kinds of entities.
The form of a word that is chosen conventionally to represent the canonical form of a word is called a lemma, or citation form.
Prosodic word vs. morphological word
Here are examples from other languages of the failure of a single phonological word to coincide with a single morphological word-form.
In Latin, one way to express the concept of 'NOUN-PHRASE1 and NOUN-PHRASE2' (as in "apples and oranges") is to suffix '-que' to the second noun phrase: "apples oranges-and", as it were.
An extreme level of this theoretical quandary posed by some phonological words is provided by the Kwak'wala language.
In Kwak'wala, as in a great many other languages, meaning relations between nouns, including possession and "semantic case", are formulated by affixes instead of by independent "words".
The three word English phrase, "with his club", where 'with' identifies its dependent noun phrase as an instrument and 'his' denotes a possession relation, would consist of two words or even just one word in many languages.
But affixation for semantic relations in Kwak'wala differs dramatically (from the viewpoint of those whose language is not Kwak'wala) from such affixation in other languages for this reason: the affixes phonologically attach not to the lexeme they pertain to semantically, but to the preceding lexeme.
Consider the following example (in Kwakw'ala, sentences begin with what corresponds to an English verb):
kwixʔid-i-da bəgwanəmai-χ-a q'asa-s-isi t'alwagwayu
Morpheme by morpheme translation:
kwixʔid-i-da = clubbed-PIVOT-DETERMINER
bəgwanəma-χ-a = man-ACCUSATIVE-DETERMINER
q'asa-s-is = otter-INSTRUMENTAL-3.PERSON.SINGULAR-POSSESSIVE
t'alwagwayu = club.
"the man clubbed the otter with his club"
(Notation notes:
1. accusative case marks an entity that something is done to.
2. determiners are words such as "the", "this", "that".
3. the concept of "pivot" is a theoretical construct that is not relevant to this discussion.)
That is, to the speaker of Kwak'wala, the sentence does not contain the "words" 'him-the-otter' or 'with-his-club' Instead, the markers -i-da (PIVOT-'the'), referring to man, attaches not to bəgwanəma ('man'), but instead to the "verb"; the markers -χ-a (ACCUSATIVE-'the'), referring to otter, attach to bəgwanəma instead of to q'asa ('otter'), etc.
To summarize differently: a speaker of Kwak'wala does not perceive the sentence to consist of these phonological words:
kwixʔid i-da-bəgwanəma χ-a-q'asa s-isi-t'alwagwayu
"clubbed PIVOT-the-mani hit-the-otter with-hisi-club
A central publication on this topic is the recent volume edited by Dixon and Aikhenvald (2007), examining the mismatch between prosodic-phonological and grammatical definitions of "word" in various Amazonian, Australian Aboriginal, Caucasian, Eskimo, Indo-European, Native North American, and West African languages, and in sign languages.
Apparently, a wide variety of languages make use of the hybrid linguistic unit clitic, possessing the grammatical features of independent words but the prosodic-phonological lack of freedom of bound morphemes.
The intermediate status of clitics poses a considerable challenge to linguistic theory.
Inflection vs. word-formation
Given the notion of a lexeme, it is possible to distinguish two kinds of morphological rules.
Some morphological rules relate to different forms of the same lexeme; while other rules relate to different lexemes.
Rules of the first kind are called inflectional rules, while those of the second kind are called word-formation.
The English plural, as illustrated by dog and dogs, is an inflectional rule; compounds like dog-catcher or dishwasher provide an example of a word-formation rule.
Informally, word-formation rules form "new words" (that is, new lexemes), while inflection rules yield variant forms of the "same" word (lexeme).
There is a further distinction between two kinds of word-formation: derivation and compounding.
Compounding is a process of word-formation that involves combining complete word-forms into a single compound form; dog-catcher is therefore a compound, because both dog and catcher are complete word-forms in their own right before the compounding process has been applied, and are subsequently treated as one form.
Derivation involves affixing bound (non-independent) forms to existing lexemes, whereby the addition of the affix derives a new lexeme.
One example of derivation is clear in this case: the word independent is derived from the word dependent by prefixing it with the derivational prefix in-, while dependent itself is derived from the verb depend.
The distinction between inflection and word-formation is not at all clear-cut.
There are many examples where linguists fail to agree whether a given rule is inflection or word-formation.
The next section will attempt to clarify this distinction.
Paradigms and morphosyntax
A paradigm is the complete set of related word-forms associated with a given lexeme.
The familiar examples of paradigms are the conjugations of verbs, and the declensions of nouns.
Accordingly, the word-forms of a lexeme may be arranged conveniently into tables, by classifying them according to shared inflectional categories such as tense, aspect, mood, number, gender or case.
For example, the personal pronouns in English can be organized into tables, using the categories of person (1st., 2nd., 3rd.), number (singular vs. plural), gender (masculine, feminine, neuter), and case (subjective, objective, and possessive).
See English personal pronouns for the details.
The inflectional categories used to group word-forms into paradigms cannot be chosen arbitrarily; they must be categories that are relevant to stating the syntactic rules of the language.
For example, person and number are categories that can be used to define paradigms in English, because English has grammatical agreement rules that require the verb in a sentence to appear in an inflectional form that matches the person and number of the subject.
In other words, the syntactic rules of English care about the difference between dog and dogs, because the choice between these two forms determines which form of the verb is to be used.
In contrast, however, no syntactic rule of English cares about the difference between dog and dog-catcher, or dependent and independent.
The first two are just nouns, and the second two just adjectives, and they generally behave like any other noun or adjective behaves.
An important difference between inflection and word-formation is that inflected word-forms of lexemes are organized into paradigms, which are defined by the requirements of syntactic rules, whereas the rules of word-formation are not restricted by any corresponding requirements of syntax.
Inflection is therefore said to be relevant to syntax, and word-formation is not.
The part of morphology that covers the relationship between syntax and morphology is called morphosyntax, and it concerns itself with inflection and paradigms, but not with word-formation or compounding.
Allomorphy
In the exposition above, morphological rules are described as analogies between word-forms: dog is to dogs as cat is to cats, and as dish is to dishes.
In this case, the analogy applies both to the form of the words and to their meaning: in each pair, the first word means "one of X", while the second "two or more of X", and the difference is always the plural form -s affixed to the second word, signaling the key distinction between singular and plural entities.
One of the largest sources of complexity in morphology is that this one-to-one correspondence between meaning and form scarcely applies to every case in the language.
In English, we have word form pairs like ox/oxen, goose/geese, and sheep/sheep, where the difference between the singular and the plural is signaled in a way that departs from the regular pattern, or is not signaled at all.
Even cases considered "regular", with the final -s, are not so simple; the -s in dogs is not pronounced the same way as the -s in cats, and in a plural like dishes, an "extra" vowel appears before the -s.
These cases, where the same distinction is effected by alternative forms of a "word", are called allomorphy.
Phonological rules constrain which sounds can appear next to each other in a language, and morphological rules, when applied blindly, would often violate phonological rules, by resulting in sound sequences that are prohibited in the language in question.
For example, to form the plural of dish by simply appending an -s to the end of the word would result in the form *, which is not permitted by the phonotactics of English.
In order to "rescue" the word, a vowel sound is inserted between the root and the plural marker, and results.
Similar rules apply to the pronunciation of the -s in dogs and cats: it depends on the quality (voiced vs. unvoiced) of the final preceding phoneme.
Lexical morphology
Lexical morphology is the branch of morphology that deals with the lexicon, which, morphologically conceived, is the collection of lexemes in a language.
As such, it concerns itself primarily with word-formation: derivation and compounding.
Models of morphology
There are three principal approaches to morphology, which each try to capture the distinctions above in different ways.
These are,
Morpheme-based morphology, which makes use of an Item-and-Arrangement approach.
Lexeme-based morphology, which normally makes use of an Item-and-Process approach.
Word-based morphology, which normally makes use of a Word-and-Paradigm approach.
Note that while the associations indicated between the concepts in each item in that list is very strong, it is not absolute.
Morpheme-based morphology
In morpheme-based morphology, word-forms are analyzed as arrangements of morphemes.
A morpheme is defined as the minimal meaningful unit of a language.
In a word like independently, we say that the morphemes are in-, depend, -ent, and ly; depend is the root and the other morphemes are, in this case, derivational affixes.
In a word like dogs, we say that dog is the root, and that -s is an inflectional morpheme.
This way of analyzing word-forms as if they were made of morphemes put after each other like beads on a string, is called Item-and-Arrangement.
The morpheme-based approach is the first one that beginners to morphology usually think of, and which laymen tend to find the most obvious.
This is so to such an extent that very often beginners think that morphemes are an inevitable, fundamental notion of morphology, and many five-minute explanations of morphology are, in fact, five-minute explanations of morpheme-based morphology.
This is, however, not so.
The fundamental idea of morphology is that the words of a language are related to each other by different kinds of rules.
Analyzing words as sequences of morphemes is a way of describing these relations, but is not the only way.
In actual academic linguistics, morpheme-based morphology certainly has many adherents, but is by no means the dominant approach.
Lexeme-based morphology
Lexeme-based morphology is (usually) an Item-and-Process approach.
Instead of analyzing a word-form as a set of morphemes arranged in sequence, a word-form is said to be the result of applying rules that alter a word-form or stem in order to produce a new one.
An inflectional rule takes a stem, changes it as is required by the rule, and outputs a word-form; a derivational rule takes a stem, changes it as per its own requirements, and outputs a derived stem; a compounding rule takes word-forms, and similarly outputs a compound stem.
Word-based morphology
Word-based morphology is a (usually) Word-and-paradigm approach.
This theory takes paradigms as a central notion.
Instead of stating rules to combine morphemes into word-forms, or to generate word-forms from stems, word-based morphology states generalizations that hold between the forms of inflectional paradigms.
The major point behind this approach is that many such generalizations are hard to state with either of the other approaches.
The examples are usually drawn from fusional languages, where a given "piece" of a word, which a morpheme-based theory would call an inflectional morpheme, corresponds to a combination of grammatical categories, for example, "third person plural."
Morpheme-based theories usually have no problems with this situation, since one just says that a given morpheme has two categories.
Item-and-Process theories, on the other hand, often break down in cases like these, because they all too often assume that there will be two separate rules here, one for third person, and the other for plural, but the distinction between them turns out to be artificial.
Word-and-Paradigm approaches treat these as whole words that are related to each other by analogical rules.
Words can be categorized based on the pattern they fit into.
This applies both to existing words and to new ones.
Application of a pattern different than the one that has been used historically can give rise to a new word, such as older replacing elder (where older follows the normal pattern of adjectival superlatives) and cows replacing kine (where cows fits the regular pattern of plural formation).
While a Word-and-Paradigm approach can explain this easily, other approaches have difficulty with phenomena such as this.
Morphological typology
In the 19th century, philologists devised a now classic classification of languages according to their morphology.
According to this typology, some languages are isolating, and have little to no morphology; others are agglutinative, and their words tend to have lots of easily-separable morphemes; while others yet are inflectional or fusional, because their inflectional morphemes are said to be "fused" together.
This leads to one bound morpheme conveying multiple pieces of information.
The classic example of an isolating language is Chinese; the classic example of an agglutinative language is Turkish; both Latin and Greek are classic examples of fusional languages.
Considering the variability of the world's languages, it becomes clear that this classification is not at all clear-cut, and many languages do not neatly fit any one of these types, and some fit in more than one.
A continuum of complex morphology of language may be adapted when considering languages.
The three models of morphology stem from attempts to analyze languages that more or less match different categories in this typology.
The Item-and-Arrangement approach fits very naturally with agglutinative languages; while the Item-and-Process and Word-and-Paradigm approaches usually address fusional languages.
The reader should also note that the classical typology also mostly applies to inflectional morphology.
There is very little fusion going on with word-formation.
Languages may be classified as synthetic or analytic in their word formation, depending on the preferred way of expressing notions that are not inflectional: either by using word-formation (synthetic), or by using syntactic phrases (analytic).
Named entity recognition
Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
For example, a NER system producing MUC-style output might tag the sentence,
Jim bought 300 shares of Acme Corp. in 2006.
Jim bought 300 shares of Acme Corp. in 2006.
NER systems have been created that use linguistic grammar-based techniques as well as statistical models.
Hand-crafted grammar-based systems typically obtain better results, but at the cost of months of work by experienced linguists.
Statistical NER systems typically require a large amount of manually annotated training data.
Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities.
The most common entity of interest in that domain has been names of genes and gene products.
Named entity types
In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent.
For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company.
Rigid designators include proper names as well as certain natural kind terms like biological species and substances.
There is a general agreement to include temporal expressions and some numerical expressions such as money and measures in named entities.
While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”).
In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar.
In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.).
It is arguable that the named entity definition is loosened in such cases for practical reasons.
At least two hierarchies of named entity types have been proposed in the literature.
BBN categories www.ldc.upenn.edu/Catalog/docs/LDC2005T33/BBN-Types-Subtypes.html, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.
Sekine's extended hierarchy nlp.cs.nyu.edu/ene/, proposed in 2002, is made of 200 subtypes.
Evaluation
Benchmarking and evaluations have been performed in the Message Understanding Conferences (MUC) organized by DARPA, International Conference on Language Resources and Evaluation (LREC), Computational Natural Language Learning (CoNLL) workshops, Automatic Content Extraction (ACE) organized by NIST, the Multilingual Entity Task Conference (MET), Information Retrieval and Extraction Exercise (IREX) and in HAREM (Portuguese language only).
State-of-the-art systems produce near-human performance.
For instance, the best system entering MUC-7 scored 93.39% of f-measure while human annotators scored 97.60% and 96.95%.
Natural language
In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed by animals for general-purpose communication, as distinguished from formal languages (such as computer-programming languages or the "languages" used in the study of formal logic, especially mathematical logic) and from constructed languages.
Defining natural language
Though the exact definition is debatable, natural language is often contrasted with artificial or constructed languages such as Esperanto, Latino Sexione, and Occidental.
Linguists have an incomplete understanding of all aspects of the rules underlying natural languages, and these rules are therefore objects of study.
The understanding of natural languages reveals much about not only how language works (in terms of syntax, semantics, phonetics, phonology, etc), but also about how the human mind and the human brain process language.
In linguistic terms, 'natural language' only applies to a language that has evolved naturally, and the study of natural language primarily involves native (first language) speakers.
The theory of universal grammar proposes that all natural languages have certain underlying rules which constrain the structure of the specific grammar for any given language.
While grammarians, writers of dictionaries, and language policy-makers all have a certain influence on the evolution of language, their ability to influence what people think they 'ought' to say is distinct from what people actually say.
Natural language applies to the latter, and is thus a 'descriptive' rather than a 'prescriptive' term.
Thus non-standard language varieties (such as African American Vernacular English) are considered to be natural while standard language varieties (such as Standard American English) which are more 'prescripted' can be considered to be at least somewhat artificial or constructed.
Native language learning
The learning of one's own native language, typically that of one's parents, normally occurs spontaneously in early human childhood and is biologically driven.
A crucial role of this process is performed by the neural activity of a portion of the human brain known as Broca's area.
There are approximately 7,000 current human languages, and many, if not most seem to share certain properties, leading to the belief in the existence of Universal Grammar, as shown by generative grammar studies pioneered by the work of Noam Chomsky.
Recently, it has been demonstrated that a dedicated network in the human brain (crucially involving Broca's area, a portion of the left inferior frontal gyrus), is selectively activated by complex verbal structures (but not simple ones) of those languages that meet the Universal Grammar requirements.
Origins of natural language
There is disagreement among anthropologists on when language was first used by humans (or their ancestors).
Estimates range from about two million (2,000,000) years ago, during the time of Homo habilis, to as recently as forty thousand (40,000) years ago, during the time of Cro-Magnon man.
However recent evidence suggests modern human language was invented or evolved in Africa prior to the dispersal of humans from Africa around 50,000 years ago.
Since all people including the most isolated indigenous groups such as the Andamanese or the Tasmanian aboriginals possess language, then it must have been present in the ancestral populations in Africa before the human population split into various groups to colonize the rest of the world.
Some claim that all nautural languages came out of one single language, known as Adamic.
Linguistic diversity
As of early 2007, there are 6,912 known living human languages.
A "living language" is simply one which is in wide use by a specific group of living people.
The exact number of known living languages will vary from 5,000 to 10,000, depending generally on the precision of one's definition of "language", and in particular on how one classifies dialects.
There are also many dead or extinct languages.
There is no clear distinction between a language and a dialect, notwithstanding linguist Max Weinreich's famous aphorism that "a language is a dialect with an army and navy."
In other words, the distinction may hinge on political considerations as much as on cultural differences, distinctive writing systems, or degree of mutual intelligibility.
It is probably impossible to accurately enumerate the living languages because our worldwide knowledge is incomplete, and it is a "moving target", as explained in greater detail by the Ethnologue's Introduction, p. 7 - 8.
With the 15th edition, the 103 newly added languages are not new but reclassified due to refinements in the definition of language.
Although widely considered an encyclopedia, the Ethnologue actually presents itself as an incomplete catalog, including only named languages that its editors are able to document.
With each edition, the number of catalogued languages has grown.
Beginning with the 14th edition (2000), an attempt was made to include all known living languages.
SIL used an internal 3-letter code fashioned after airport codes to identify languages.
This was the precursor to the modern ISO 639-3 standard, to which SIL contributed.
The standard allows for over 14,000 languages.
In turn, the 15th edition was revised to conform to the pending ISO 639-3 standard.
Of the catalogued languages, 497 have been flagged as "nearly extinct" due to trends in their usage.
Per the 15th edition, 6,912 living languages are shared by over 5.7 billion speakers. (p. 15)
Taxonomy
The classification of natural languages can be performed on the basis of different underlying principles (different closeness notions, respecting different properties and relations between languages); important directions of present classifications are:
paying attention to the historical evolution of languages results in a genetic classification of languages—which is based on genetic relatedness of languages,
paying attention to the internal structure of languages (grammar) results in a typological classification of languages—which is based on similarity of one or more components of the language's grammar across languages,
and respecting geographical closeness and contacts between language-speaking communities results in areal groupings of languages.
The different classifications do not match each other and are not expected to, but the correlation between them is an important point for many linguistic research works.
(There is a parallel to the classification of species in biological phylogenetics here: consider monophyletic vs. polyphyletic groups of species.)
The task of genetic classification belongs to the field of historical-comparative linguistics, of typological—to linguistic typology.
See also Taxonomy, and Taxonomic classification for the general idea of classification and taxonomies.
Genetic classification
The world's languages have been grouped into families of languages that are believed to have common ancestors.
Some of the major families are the Indo-European languages, the Afro-Asiatic languages, the Austronesian languages, and the Sino-Tibetan languages.
The shared features of languages from one family can be due to shared ancestry.
(Compare with homology in biology.)
Typological classification
An example of a typological classification is the classification of languages on the basis of the basic order of the verb, the subject and the object in a sentence into several types: SVO, SOV, VSO, and so on, languages.
(English, for instance, belongs to the SVO language type.)
The shared features of languages of one type (= from one typological class) may have arisen completely independently.
(Compare with analogy in biology.)
Their cooccurence might be due to the universal laws governing the structure of natural languages—language universals.
Areal classification
The following language groupings can serve as some linguistically significant examples of areal linguistic units, or sprachbunds: Balkan linguistic union, or the bigger group of European languages; Caucasian languages; East Asian languages.
Although the members of each group are not closely genetically related, there is a reason for them to share similar features, namely: their speakers have been in contact for a long time within a common community and the languages converged in the course of the history.
These are called "areal features".
One should be careful about the underlying classification principle for groups of languages which have apparently a geographical name: besides areal linguistic units, the taxa of the genetic classification (language families) are often given names which themselves or parts of which refer to geographical areas.
Controlled languages
Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity.
The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language.
An example of a widely used controlled natural language is Simplified English, which was originally developed for aerospace industry maintenance manuals.
Constructed languages and international auxiliary languages
Constructed international auxiliary languages such as Esperanto and Interlingua that have native speakers are by some also considered natural languages.
However, constructed languages, while they are clearly languages, are not generally considered natural languages.
The problem is that other languages have been used to communicate and evolve in a natural way, while Esperanto has been selectively designed by L.L. Zamenhof from natural languages, not grown from the natural fluctuations in vocabulary and syntax.
Nor has Esperanto been naturally "standardized" by children's natural tendency to correct for illogical grammar structures in their parents' language, which can be seen in the development of pidgin languages into creole languages (as explained by Steven Pinker in The Language Instinct).
The possible exception to this are true native speakers of such languages.
More substantive basis for this designation is that the vocabulary, grammar, and orthography of Interlingua are natural; they have been standardized and presented by a linguistic research body, but they predated it and are not themselves considered a product of human invention.
Most experts, however, consider Interlingua to be naturalistic rather than natural.
Latino Sine Flexione, a second naturalistic auxiliary language, is also naturalistic in content but is no longer widely spoken.
Natural Language Processing
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics.
It studies the problems of automated generation and understanding of natural human languages.
Natural-language-generation systems convert information from computer databases into normal-sounding human language.
Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
Modalities
Natural language manifests itself in modalities other than speech.
Sign languages
In linguistic terms, sign languages are as rich and complex as any oral language, despite the previously common misconception that they are not "real languages".
Professional linguists have studied many sign languages and found them to have every linguistic component required to be classed as true natural languages.
Sign languages are not pantomime, much as most spoken language is not onomatopoeic.
The signs do tend to exploit iconicity (visual connections with their referents) more than what is common in spoken language, but they are above all conventional and hence generally incomprehensible to non-speakers, just like spoken words and morphemes.
They are not a visual rendition of an oral language either.
They have complex grammars of their own, and can be used to discuss any topic, from the simple and concrete to the lofty and abstract.
Written languages
In a sense, written language should be distinguished from natural language.
Until recently in the developed world, it was common for many people to be fluent in spoken or signed languages and yet remain illiterate; this is still the case in poor countries today.
Furthermore, natural language acquisition during childhood is largely spontaneous, while literacy must usually be intentionally acquired.
Natural language processing
Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics.
It studies the problems of automated generation and understanding of natural human languages.
Natural-language-generation systems convert information from computer databases into normal-sounding human language.
Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
Tasks and limitations
In theory, natural-language processing is a very attractive method of human-computer interaction.
Early systems such as SHRDLU, working in restricted "blocks worlds" with restricted vocabularies, worked extremely well, leading researchers to excessive optimism, which was soon lost when the systems were extended to more realistic situations with real-world ambiguity and complexity.
Natural-language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it.
The definition of "understanding" is one of the major problems in natural-language processing.
Concrete problems
Some examples of the problems faced by natural-language-understanding systems:
The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure.
However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas.
A string of words may be interpreted in different ways.
For example, the string Time flies like an arrow may be interpreted in a variety of ways:
The common simile: time moves quickly just like an arrow does;
measure the speed of flies like you would measure that of an arrow (thus interpreted as an imperative) - i.e. (You should) time flies as you would (time) an arrow.;
measure the speed of flies like an arrow would - i.e. Time flies in the same way that an arrow would (time them).;
measure the speed of flies that are like arrows - i.e. Time those flies that are like arrows;
all of a type of flying insect, "time-flies," collectively enjoys a single arrow (compare Fruit flies like a banana);
each of a type of flying insect, "time-flies," individually enjoys a different arrow (similar comparison applies);
A concrete object, for example the magazine, Time, travels through the air in an arrow-like manner.
English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech.
English and several other languages don't specify which word an adjective applies to.
For example, in the string "pretty little girls' school".
Does the school look little?
Do the girls look little?
Do the girls look pretty?
Does the school look pretty?
We will often imply additional information in spoken language by the way we place stress on words.
The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.
Depending on which word the speaker places the stress, this sentence could have several distinct meanings:
"I never said she stole my money" - Someone else said it, but I didn't.
"I never said she stole my money" - I simply didn't ever say it.
"I never said she stole my money" - I might have implied it in some way, but I never explicitly said it.
"I never said she stole my money" - I said someone took it; I didn't say it was she.
"I never said she stole my money" - I just said she probably borrowed it.
"I never said she stole my money" - I said she stole someone else's money.
"I never said she stole my money" - I said she stole something, but not my money.
Subproblems
Speech segmentation:
In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the analog signal to discrete characters can be a very difficult process.
Also, in natural speech there are hardly any pauses between successive words; the location of those boundaries usually must take into account grammatical and semantic constraints, as well as the context.
Text segmentation:
Some written languages like Chinese, Japanese and Thai do not have single-word boundaries either, so any significant text parsing usually requires the identification of word boundaries, which is often a non-trivial task.
Word sense disambiguation:
Many words have more than one meaning; we have to select the meaning which makes the most sense in context.
Syntactic ambiguity:
The grammar for natural languages is ambiguous, i.e. there are often multiple possible parse trees for a given sentence.
Choosing the most appropriate one usually requires semantic and contextual information.
Specific problem components of syntactic ambiguity include sentence boundary disambiguation.
Imperfect or irregular input :
Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, OCR errors in texts.
Speech acts and plans:
A sentence can often be considered an action by the speaker.
The sentence structure, alone, may not contain enough information to define this action.
For instance, a question is actually the speaker requesting some sort of response from the listener.
The desired response may be verbal, physical, or some combination.
For example, "Can you pass the class?" is a request for a simple yes-or-no answer, while "Can you pass the salt?" is requesting a physical action to be performed.
It is not appropriate to respond with "Yes, I can pass the salt," without the accompanying action (although "No" or "I can't reach the salt" would explain a lack of action).
Statistical NLP
Statistical natural-language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars, yielding thousands or millions of possible analyses.
Methods for disambiguation often involve the use of corpora and Markov models.
Statistical NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, information theory, and linear algebra.
The technology for statistical NLP comes mainly from machine learning and data mining, both of which are fields of artificial intelligence that involve learning from data.
Major tasks in NLP
Automatic summarization
Foreign language reading aid
Foreign language writing aid
Information extraction
Information retrieval
Machine translation
Named entity recognition
Natural language generation
Natural language understanding
Optical character recognition
Question answering
Speech recognition
Spoken dialogue system
Text simplification
Text to speech
Text-proofing
Evaluation of natural language processing
Objectives
The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system, in order to determine if (or to what extent) the system answers the goals of its designers, or the needs of its users.
Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem, going thus beyond the vagueness of tasks defined only as language understanding or language generation.
A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem.
Short history of evaluation in NLP
The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 (Pallet 1998).
Then, the Parseval/GEIG project compared phrase-structure grammars (Black 1991).
A series of campaigns within Tipster project were realized on tasks like summarization, translation and searching (Hirshman 1998).
In 1994, in Germany, the Morpholympics compared German taggers.
Then, the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation.
In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian).
In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999).
In 2004, during the Technolangue/Easy project, 13 parsers for French were compared.
Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007.
In Italy, the evalita campaign was conducted in 2007 to compare various tools for Italian evalita web site.
In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared passage web site.
Adda G., Mariani J., Paroubek P., Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discours pour le français. Langues vol-2
Black E., Abney S., Flickinger D., Gdaniec C., Grishman R., Harrison P., Hindle D., Ingria R., Jelinek F., Klavans J., Liberman M., Marcus M., Reukos S., Santoni B., Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars. DARPA Speech and Natural Language Workshop
Hirshman L. 1998 Language understanding evaluation: lessons learned from MUC and ATIS. LREC Granada
Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests. LREC Granada
Different types of evaluation
Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.
Intrinsic vs. extrinsic evaluation
Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result, pre-defined by the evaluators.
Extrinsic evaluation, also called evaluation in use considers the NLP system in a more complex setting, either as an embedded system or serving a precise function for a human user.
The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user.
Black-box vs. glass-box evaluation
Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process (speed, reliability, resource consumption) and, most importantly, to the quality of the result (e.g. the accuracy of data annotation or the fidelity of a translation).
Glass-box evaluation looks at the design of the system, the algorithms that are implemented, the linguistic resources it uses (e.g. vocabulary size), etc.
Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation, but this type of evaluation is more informative with respect to error analysis or future developments of a system.
Automatic vs. manual evaluation
In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard (or desired) one.
Although the cost of producing the gold standard can be quite high, automatic evaluation can be repeated as often as needed without much additional costs (on the same input data).
However, for many NLP problems, the definition of a gold standard is a complex task, and can prove impossible when inter-annotator agreement is insufficient.
Manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria.
Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings.
This is why automatic evaluation is sometimes referred to as objective evaluation, while the human kind appears to be more subjective.
Shared tasks (Campaigns)
BioCreative
Message Understanding Conference
Technolangue/Easy
Text Retrieval Conference
Standardization in NLP
An ISO sub-committee is working in order to ease interoperability between Lexical resources and NLP programs.
The sub-committee is part of ISO/TC37 and is called ISO/TC37/SC4.
Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see LMF), annotation and data category registry.
Neural network
Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons.
The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
Thus the term has two distinct usages:
Biological neural networks are made up of real biological neurons that are connected or functionally-related in the peripheral nervous system or the central nervous system.
In the field of neuroscience, they are often identified as groups of neurons that perform a specific physiological function in laboratory analysis.
Artificial neural networks are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons).
Artificial neural networks may either be used to gain an understanding of biological neural networks, or for solving artificial intelligence problems without necessarily creating a model of a real biological system.
This article focuses on the relationship between the two concepts; for detailed coverage of the two different concepts refer to the separate articles: Biological neural network and Artificial neural network.
Characterization
In general a biological neural network is composed of a group or groups of chemically connected or functionally associated neurons.
A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive.
Connections, called synapses, are usually formed from axons to dendrites, though dendrodendritic microcircuits and other connections are possible.
Apart from the electrical signaling, there are other forms of signaling that arise from neurotransmitter diffusion, which have an effect on electrical signaling.
As such, neural networks are extremely complex.
Artificial intelligence and cognitive modeling try to simulate some properties of neural networks.
While similar in their techniques, the former has the aim of solving particular tasks, while the latter aims to build mathematical models of biological neural systems.
In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.
Most of the currently employed artificial neural networks for artificial intelligence are based on statistical estimation, optimization and control theory.
The cognitive modelling field involves the physical or mathematical modeling of the behaviour of neural systems; ranging from the individual neural level (e.g. modelling the spike response curves of neurons to a stimulus), through the neural cluster level (e.g. modelling the release and effects of dopamine in the basal ganglia) to the complete organism (e.g. behavioural modelling of the organism's response to stimuli).
The brain, neural networks and computers
Neural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is debated.
A subject of current research in theoretical neuroscience is the question surrounding the degree of complexity and the properties that individual neural elements should have to reproduce something resembling animal intelligence.
Historically, computers evolved from the von Neumann architecture, which is based on sequential processing and execution of explicit instructions.
On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems, which may rely largely on parallel processing as well as implicit instructions based on recognition of patterns of 'sensory' input from external sources.
In other words, at its very heart a neural network is a complex statistical processor (as opposed to being tasked to sequentially process and execute).
Neural networks and artificial intelligence
An artificial neural network (ANN), also called a simulated neural network (SNN) or commonly just neural network (NN) is an interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation.
In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network.
In more practical terms neural networks are non-linear statistical data modeling or decision making tools.
They can be used to model complex relationships between inputs and outputs or to find patterns in data.
Background
An artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behaviour, determined by the connections between the processing elements and element parameters.
One classical type of artificial neural network is the Hopfield net.
In a neural network model simple nodes, which can be called variously "neurons", "neurodes", "Processing Elements" (PE) or "units", are connected together to form a network of nodes — hence the term "neural network".
While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
In modern software implementations of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing.
In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that combine both adaptive and non-adaptive elements.
The concept of a neural network appears to have first been proposed by Alan Turing in his 1948 paper "Intelligent Machinery".
Applications
The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it.
This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
Real life applications
The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
Function approximation, or regression analysis, including time series prediction and modelling.
Classification, including pattern and sequence recognition, novelty detection and sequential decision making.
Data processing, including filtering, clustering, blind signal separation and compression.
Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition, etc.), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.
Neural network software
Main article: Neural network software
Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.
Learning paradigms
There are three major learning paradigms, each corresponding to a particular abstract learning task.
These are supervised learning, unsupervised learning and reinforcement learning.
Usually any given type of network architecture can be employed in any of those tasks.
Supervised learning
In supervised learning, we are given a set of example pairs and the aim is to find a function in the allowed class of functions that matches the examples.
In other words, we wish to infer how the mapping implied by the data and the cost function is related to the mismatch between our mapping and the data.
Unsupervised learning
In unsupervised learning we are given some data , and a cost function which is to be minimized which can be any function of and the network's output, .
The cost function is determined by the task formulation.
Most applications fall within the domain of estimation problems such as statistical modeling, compression, filtering, blind source separation and clustering.
Reinforcement learning
In reinforcement learning, data is usually not given, but generated by an agent's interactions with the environment.
At each point in time , the agent performs an action and the environment generates an observation and an instantaneous cost , according to some (usually unknown) dynamics.
The aim is to discover a policy for selecting actions that minimises some measure of a long-term cost, i.e. the expected cumulative cost.
The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.
Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.
Learning algorithms
There are many algorithms for training neural networks; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.
Evolutionary computation methods, simulated annealing, expectation maximization and non-parametric methods are among other commonly used methods for training neural networks.
See also machine learning.
Recent developments in this field also saw the use of particle swarm optimization and other swarm intelligence techniques used in the training of neural networks.
Neural networks and neuroscience
Theoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems.
Since neural systems are intimately related to cognitive processes and behaviour, the field is closely related to cognitive and behavioural modeling.
The aim of the field is to create models of biological neural systems in order to understand how biological systems work.
To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (biological neural network models) and theory (statistical learning theory and information theory).
Types of models
Many models are used in the field, each defined at a different level of abstraction and trying to model different aspects of neural systems.
They range from models of the short-term behaviour of individual neurons, through models of how the dynamics of neural circuitry arise from interactions between individual neurons, to models of how behaviour can arise from abstract neural modules that represent complete subsystems.
These include models of the long-term and short-term plasticity of neural systems and its relation to learning and memory, from the individual neuron to the system level.
Current research
While initially research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as dopamine, acetylcholine, and serotonin on behaviour and learning.
Biophysical models, such as BCM theory, have been important in understanding mechanisms for synaptic plasticity, and have had applications in both computer science and neuroscience.
Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for radial basis networks and neural backpropagation as mechanisms for processing data.
History of the neural network analogy
The concept of neural networks started in the late-1800s as an effort to describe how the human mind performed.
These ideas started being applied to computational models with the Perceptron.
In early 1950s Friedrich Hayek was one of the first to posit the idea of spontaneous order in the brain arising out of decentralized networks of simple units (neurons).
In the late 1940s, Donald Hebb made one of the first hypotheses for a mechanism of neural plasticity (i.e. learning), Hebbian learning.
Hebbian learning is considered to be a 'typical' unsupervised learning rule and it (and variants of it) was an early model for long term potentiation.
The Perceptron is essentially a linear classifier for classifying data specified by parameters and an output function .
Its parameters are adapted with an ad-hoc rule similar to stochastic steepest gradient descent.
Because the inner product is a linear operator in the input space, the Perceptron can only perfectly classify a set of data for which different classes are linearly separable in the input space, while it often fails completely for non-separable data.
While the development of the algorithm initially generated some enthusiasm, partly because of its apparent relation to biological mechanisms, the later discovery of this inadequacy caused such models to be abandoned until the introduction of non-linear models into the field.
The Cognitron (1975) was an early multilayered neural network with a training algorithm.
The actual structure of the network and the methods used to set the interconnection weights change from one neural strategy to another, each with its advantages and disadvantages.
Networks can propagate information in one direction only, or they can bounce back and forth until self-activation at a node occurs and the network settles on a final state.
The ability for bi-directional flow of inputs between neurons/nodes was produced with the Hopfield's network (1982), and specialization of these node layers for specific purposes was introduced through the first hybrid network.
The parallel distributed processing of the mid-1980s became popular under the name connectionism.
The rediscovery of the backpropagation algorithm was probably the main reason behind the repopularisation of neural networks after the publication of "Learning Internal Representations by Error Propagation" in 1986 (Though backpropagation itself dates from 1974).
The original network utilised multiple layers of weight-sum units of the type , where was a sigmoid function or logistic function such as used in logistic regression.
Training was done by a form of stochastic steepest gradient descent.
The employment of the chain rule of differentiation in deriving the appropriate parameter updates results in an algorithm that seems to 'backpropagate errors', hence the nomenclature.
However it is essentially a form of gradient descent.
Determining the optimal parameters in a model of this type is not trivial, and steepest gradient descent methods cannot be relied upon to give the solution without a good starting point.
In recent times, networks with the same architecture as the backpropagation network are referred to as Multi-Layer Perceptrons.
This name does not impose any limitations on the type of algorithm used for learning.
The backpropagation network generated much enthusiasm at the time and there was much controversy about whether such learning could be implemented in the brain or not, partly because a mechanism for reverse signalling was not obvious at the time, but most importantly because there was no plausible source for the 'teaching' or 'target' signal.
Criticism
A. K. Dewdney, a former Scientific American columnist, wrote in 1997, “Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool.”
(Dewdney, p.82)
Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraftwww.nasa.gov/centers/dryden/news/NewsReleases/2003/03-49.html to detecting credit card fraudwww.visa.ca/en/about/visabenefits/innovation.cfm.
Technology writer Roger Bridgman commented on Dewdney's statements about neural nets:
Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource".
In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers.
An unreadable table that a useful machine could read would still be well worth having.
N-gram
An n-gram is a sub-sequence of n items from a given sequence.
n-grams are used in various areas of statistical natural language processing and genetic sequence analysis.
The items in question can be letters, words or base pairs according to the application.
An n-gram of size 1 is a "unigram"; size 2 is a "bigram" (or, more etymologically sound but less commonly used, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram".
Some language models built from n-grams are "(n − 1)-order Markov models".
Examples
Here are examples of word level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.
ceramics collectables collectibles (55)
ceramics collectables fine (130)
ceramics collected by (52)
ceramics collectible pottery (50)
ceramics collectibles cooking (45)
4-grams
serve as the incoming (92)
serve as the incubator (99)
serve as the independent (794)
serve as the index (223)
serve as the indication (72)
serve as the indicator (120)
n-gram models
An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.
This idea can be traced to an experiment by Claude Shannon's work in information theory.
His question was, given a sequence of letters (for example, the sequence "for ex"), what is the likelihood of the next letter?
From training data, one can derive a probability distribution for the next letter given a history of size : a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0.
More concisely, an n-gram model predicts based on .
In Probability terms, this is nothing but .
When used for language modeling independence assumptions are made so that each word depends only on the last n words.
This Markov model is used as an approximation of the true underlying language.
This assumption is important because it massively simplifies the problem of learning the language model from data.
In addition, because of the open nature of language, it is common to group words unknown to the language model together.
n-gram models are widely used in statistical natural language processing.
In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution.
For parsing, words are modeled such that each n-gram is composed of n words.
For language recognition, sequences of letters are modeled for different languages.
For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "the dog smelled", "dog smelled like", "smelled like a", and "like a skunk".
For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth.
Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks.
Punctuation is also commonly reduced or removed by preprocessing.
n-grams can also be used for sequences of words or, in fact, for almost any type of data.
They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.
They have also been very successful as the first pass in genetic sequence search and in the identification of which species short sequences of DNA were taken from.
N-gram models are often criticized because they lack any explicit representation of long range dependency.
While it is true that the only explicit dependency range is (n-1) tokens for an n-gram model, it is also true that the effective range of dependency is significantly longer than this although long range correlations drop exponentially with distance for any Markov model.
Alternative Markov language models that incorporate some degree of local state can exhibit very long range dependencies.
This is often done using hand-crafted state variables that represent, for instance, the position in a sentence, the general topic of discourse or a grammatical state variable.
Some of the best parsers of English currently in existence are roughly of this form.
Another criticism that has been leveled is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction introduced by Noam Chomsky.
This criticism fails to explain why parsers that are the best at parsing text seem to uniformly lack any such distinction and most even lack any clear distinction between semantics and syntax.
Most proponents of n-gram and related language models opt for a fairly pragmatic approach to language modeling that emphasizes empirical results over theoretical purity.
n-grams for approximate matching
n-grams can also be used for efficient approximate matching.
By converting a sequence of items to a set of n-grams, it can be embedded in a vector space (in other words, represented as a histogram), thus allowing the sequence to be compared to other sequences in an efficient manner.
For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a -dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters).
Using this representation, we lose information about the string.
For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams.
However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar.
Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results.
For example z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the "background" vector).
In the event of small counts, the g-score may give better results for comparing alternative models.
It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.
Other applications
n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.
They have been used to:
design kernels that allow machine learning algorithms such as support vector machines to learn from string data
find likely candidates for the correct spelling of a misspelled word
improve compression in compression algorithms where a small area of data requires n-grams of greater length
assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR (optical character recognition), Intelligent Character Recognition (ICR), machine translation and similar applications
improve retrieval in information retrieval systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents
improve retrieval performance in genetic sequence analysis as in the BLAST family of programs
identify the language a text is in or the species a small sequence of DNA was taken from
predict letters or words at random in order to create text, as in the dissociated press algorithm.
Bias-versus-variance trade-off
What goes into picking the n for the n-gram?
There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams.
Also, items not seen in the training data will be given a probability of 0.0 without smoothing.
For unseen but plausible data from a sample, one can introduce pseudocounts.
Pseudocounts are generally motivated on Bayesian grounds.
Smoothing techniques
Linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram)
Good-Turing discounting
Witten-Bell discounting
Katz's back-off model (trigram)
Google use of N-gram
Google uses n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, checking spelling, entity detection, and data mining.
In September of 2006 Google announced that they made their n-grams public at the Linguistic Data Consortium (LDC).
Noun
In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition.
Lexical categories are defined in terms of how their members combine with other kinds of expressions.
The syntactic rules for nouns differ from language to language.
In English, nouns may be defined as those words which can occur with articles and attributive adjectives and can function as the head of a noun phrase.
In traditional English grammar, the noun is one of the eight parts of speech.
History
The word comes from the Latin nomen meaning "name".
Word classes like nouns were first described by the Sanskrit grammarian [[Panini (grammarian)|Pāṇini]] and ancient Greeks like Dionysios Thrax; and were defined in terms of their morphological properties.
For example, in Ancient Greek, nouns inflect for grammatical case, such as dative or accusative.
Verbs, on the other hand, inflect for tenses, such as past, present or future, while nouns do not.
Aristotle also had a notion of onomata (nouns) and rhemata (verbs) which, however, does not exactly correspond with modern notions of nouns and verbs.
Vinokurova 2005 has a more detailed discussion of the historical origin of the notion of a noun.
Different definitions of nouns
Expressions of natural language have properties at different levels.
They have formal properties, like what kinds of morphological prefixes or suffixes they take and what kinds of other expressions they combine with; but they also have semantic properties, i.e. properties pertaining to their meaning.
The definition of a noun at the outset of this page is thus a formal, traditional grammatical definition.
That definition, for the most part, is considered uncontroversial and furnishes the propensity for certain language users to effectively distinguish most nouns from non-nouns.
However, it has the disadvantage that it does not apply to nouns in all languages.
For example in Russian, there are no definite articles, so one cannot define nouns as words that are modified by definite articles.
There are also several attempts of defining nouns in terms of their semantic properties.
Many of these are controversial, but some are discussed below.
Names for things
In traditional school grammars, one often encounters the definition of nouns that they are all and only those expressions that refer to a person, place, thing, event, substance, quality, or idea, etc.
This is a semantic definition.
It has been criticized by contemporary linguists as being uninformative.
Contemporary linguists generally agree that one cannot successfully define nouns (or other grammatical categories) in terms of what sort of object in the world they refer to or signify.
Part of the conundrum is that the definition makes use of relatively general nouns ("thing", "phenomenon", "event") to define what nouns are.
The existence of such general nouns demonstrates that nouns refer to entities that are organized in taxonomic hierarchies.
But other kinds of expressions are also organized into such structured taxonomic relationships.
For example the verbs "stroll","saunter", "stride", and "tread" are more specific words than the more general "walk".
Moreover, "walk" is more specific than the verb "move", which, in turn, is less general than "change".
But it is unlikely that such taxonomic relationships can be used to define nouns and verbs.
We cannot define verbs as those words that refer to "changes" or "states", for example, because the nouns change and state probably refer to such things, but, of course, aren't verbs.
Similarly, nouns like "invasion", "meeting", or "collapse" refer to things that are "done" or "happen".
In fact, an influential theory has it that verbs like "kill" or "die" refer to events, which is among the sort of thing that nouns are supposed to refer to.
The point being made here is not that this view of verbs is wrong, but rather that this property of verbs is a poor basis for a definition of this category, just like the property of having wheels is a poor basis for a definition of cars (some things that have wheels, such as my suitcase or a jumbo jet, aren't cars).
Similarly, adjectives like "yellow" or "difficult" might be thought to refer to qualities, and adverbs like "outside" or "upstairs" seem to refer to places, which are also among the sorts of things nouns can refer to.
But verbs, adjectives and adverbs are not nouns, and nouns aren't verbs, adjectives or adverbs.
One might argue that "definitions" of this sort really rely on speakers' prior intuitive knowledge of what nouns, verbs and adjectives are, and, so don't really add anything over and beyond this.
Speakers' intuitive knowledge of such things might plausibly be based on formal criteria, such as the traditional grammatical definition of English nouns aforementioned.
Prototypically referential expressions
Another semantic definition of nouns is that they are prototypically referential.
That definition is also not very helpful in distinguishing actual nouns from verbs.
But it may still correctly identify a core property of nounhood.
For example, we will tend to use nouns like "fool" and "car" when we wish to refer to fools and cars, respectively.
The notion that this is prototypical reflects the fact that such nouns can be used, even though nothing with the corresponding property is referred to:
John is no fool.
If I had a car, I'd go to Marrakech.
The first sentence above doesn't refer to any fools, nor does the second one refer to any particular car.
Predicates with identity criteria
The British logician Peter Thomas Geach proposed a very subtle semantic definition of nouns.
He noticed that adjectives like "same" can modify nouns, but no other kinds of parts of speech, like verbs or adjectives.
Not only that, but there also doesn't seem to be any other expressions with similar meaning that can modify verbs and adjectives.
Consider the following examples.
Good: John and Bill participated in the same fight.
Bad: *John and Bill samely fought.
There is no English adverb "samely".
In some other languages, like Czech, however there are adverbs corresponding to "samely".
Hence, in Czech, the translation of the last sentence would be fine; however, it would mean that John and Bill fought in the same way: not that they participated in the same fight.
Geach proposed that we could explain this, if nouns denote logical predicates with identity criteria.
An identity criterion would allow us to conclude, for example, that "person x at time 1 is the same person as person y at time 2".
Different nouns can have different identity criteria.
A well known example of this is due to Gupta:
National Airlines transported 2 million passengers in 1979.
National Airlines transported (at least) 2 million persons in 1979.
Given that, in general, all passengers are persons, the last sentence above ought to follow logically from the first one.
But it doesn't.
It is easy to imagine, for example, that on average, every person who travelled with National Airlines in 1979, travelled with them twice.
In that case, one would say that the airline transported 2 million passengers but only 1 million persons.
Thus, the way that we count passengers isn't necessarily the same as the way that we count persons.
Put somewhat differently: At two different times, you may correspond to two distinct passengers, even though you are one and the same person.
For a precise definition of identity criteria, see Gupta.
Recently, Baker has proposed that Geach's definition of nouns in terms of identity criteria allows us to explain the characteristic properties of nouns.
He argues that nouns can co-occur with (in-)definite articles and numerals, and are "prototypically referential" because they are all and only those parts of speech that provide identity criteria.
Baker's proposals are quite new, and linguists are still evaluating them.
Classification of nouns in English
Proper nouns and common nouns
Proper nouns (also called proper names) are nouns representing unique entities (such as London, Universe or John), as distinguished from common nouns which describe a class of entities (such as city, planet or person).
In English and most other languages that use the Latin alphabet, proper nouns are usually capitalized.
Languages differ in whether most elements of multiword proper nouns are capitalised (e.g., American English House of Representatives) or only the initial element (e.g., Slovenian Državni zbor 'National Assembly').
In German, nouns of all types are capitalized.
The convention of capitalizing all nouns was previously used in English, but ended circa 1800.
In America, the shift in capitalization is recorded in several noteworthy documents.
The end (but not the beginning) of the Declaration of Independence (1776) and all of the Constitution (1787) show nearly all nouns capitalized, the Bill of Rights (1789) capitalizes a few common nouns but not most of them, and the Thirteenth Constitutional Amendment (1865) only capitalizes proper nouns.
Sometimes the same word can function as both a common noun and a proper noun, where one such entity is special.
For example the common noun god denotes all deities, while the proper noun God references the monotheistic God specifically.
Owing to the essentially arbitrary nature of orthographic classification and the existence of variant authorities and adopted house styles, questionable capitalization of words is not uncommon, even in respected newspapers and magazines.
Most publishers, however, properly require consistency, at least within the same document, in applying their specified standard.
The common meaning of the word or words constituting a proper noun may be unrelated to the object to which the proper noun refers.
For example, someone might be named "Tiger Smith" despite being neither a tiger nor a smith.
For this reason, proper nouns are usually not translated between languages, although they may be transliterated.
For example, the German surname Knödel becomes Knodel or Knoedel in English (not the literal Dumpling).
However, the transcription of place names and the names of monarchs, popes, and non-contemporary authors is common and sometimes universal.
For instance, the Portuguese word Lisboa becomes Lisbon in English; the English London becomes Londres in French; and the Greek Aristotelēs becomes Aristotle in English.
Countable and uncountable nouns
Count nouns are common nouns that can take a plural, can combine with numerals or quantifiers (e.g. "one", "two", "several", "every", "most"), and can take an indefinite article ("a" or "an").
Examples of count nouns are "chair", "nose", and "occasion".
Mass nouns (or non-count nouns) differ from count nouns in precisely that respect: they can't take plural or combine with number words or quantifiers.
Examples from English include "laughter", "cutlery", "helium", and "furniture".
For example, it is not possible to refer to "a furniture" or "three furnitures".
This is true even though the pieces of furniture comprising "furniture" could be counted.
Thus the distinction between mass and count nouns shouldn't be made in terms of what sorts of things the nouns refer to, but rather in terms of how the nouns present these entities.
Collective nouns
Collective nouns are nouns that refer to groups consisting of more than one individual or entity, even when they are inflected for the singular.
Examples include "committee", "herd", and "school" (of herring).
These nouns have slightly different grammatical properties than other nouns.
For example, the noun phrases that they head can serve as the subject of a collective predicate, even when they are inflected for the singular.
A collective predicate is a predicate that normally can't take a singular subject.
An example of the latter is "talked to each other".
Good: The boys talked to each other.
Bad: *The boy talked to each other.
Good: The committee talked to each other.
Concrete nouns and abstract nouns
Concrete nouns refer to physical bodies which you use at least one of your senses to observe.
For instance, "chair", "apple", or "Janet".
Abstract nouns on the other hand refer to abstract objects, that is ideas or concepts, such as "justice" or "hate".
While this distinction is sometimes useful, the boundary between the two of them is not always clear; consider, for example, the noun "art".
In English, many abstract nouns are formed by adding noun-forming suffixes ("-ness", "-ity", "-tion") to adjectives or verbs.
Examples are "happiness", "circulation" and "serenity".
Nouns and pronouns
Noun phrases can typically be replaced by pronouns, such as "he", "it", "which", and "those", in order to avoid repetition or explicit identification, or for other reasons.
For example, in the sentence "Janet thought that he was weird", the word "he" is a pronoun standing in place of the name of the person in question.
The English word one can replace parts of noun phrases, and it sometimes stands in for a noun.
An example is given below:
John's car is newer than the one that Bill has.
But one can also stand in for bigger subparts of a noun phrase.
For example, in the following example, one can stand in for new car.
This new car is cheaper than that one.
Substantive as a word for "noun"
Starting with old Latin grammars, many European languages use some form of the word substantive as the basic term for noun.
Nouns in the dictionaries of such languages are demarked by the abbreviation "s" instead of "n", which may be used for proper nouns instead.
This corresponds to those grammars in which nouns and adjectives phase into each other in more areas than, for example, the English term predicate adjective entails.
In French and Spanish, for example, adjectives frequently act as nouns referring to people who have the characteristics of the adjective.
An example in English is:
The poor you have always with you.
Similarly, an adjective can also be used for a whole group or organization of people:
The Socialist International.
Hence, these words are substantives that are usually adjectives in English.
Ontology (information science)
In both computer science and information science, an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.
It is used to reason about the properties of that domain, and may be used to define the domain.
Ontologies are used in artificial intelligence, the Semantic Web, software engineering, biomedical informatics, library science, and information architecture as a form of knowledge representation about the world or some part of it.
Common components of ontologies include:
Individuals: instances or objects (the basic or "ground level" objects)
Classes: sets, collections, concepts or types of objects
Attributes: properties, features, characteristics, or parameters that objects (and classes) can have
Relations: ways that classes and objects can be related to one another
Function terms: complex structures formed from certain relations that can be used in place of an individual term in a statement
Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input
Rules: statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form
Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the ontology describes in its domain of application.
This definition differs from that of "axioms" in generative grammar and formal logic.
In these disciplines, axioms include only statements asserted as a priori knowledge.
As used here, "axioms" also include the theory derived from axiomatic statements.
Events: the changing of attributes or relations
Ontologies are commonly encoded using ontology languages.
Elements
Contemporary ontologies share many structural similarities, regardless of the language in which they are expressed.
As mentioned above, most ontologies describe individuals (instances), classes (concepts), attributes, and relations.
In this section each of these components is discussed in turn.
Individuals
Individuals (instances) are the basic, "ground level" components of an ontology.
The individuals in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract individuals such as numbers and words.
Strictly speaking, an ontology need not include any individuals, but one of the general purposes of an ontology is to provide a means of classifying individuals, even if those individuals are not explicitly part of the ontology.
In formal extensional ontologies, only the utterances of words and numbers are considered individuals – the numbers and names themselves are classes.
In a 4D ontology, an individual is identified by its spatio-temporal extent.
Examples of formal extensional ontologies are ISO 15926 and the model in development by the IDEAS Group.
Classes
Classes – concepts that are also called type, sort, category, and kind – are abstract groups, sets, or collections of objects.
They may contain individuals, other classes, or a combination of both.
Some examples of classes:
Person, the class of all people
Vehicle, the class of all vehicles
Car, the class of all cars
Class, representing the class of all classes
Thing, representing the class of all things
Ontologies vary on whether classes can contain other classes, whether a class can belong to itself, whether there is a universal class (that is, a class containing everything), etc.
Sometimes restrictions along these lines are made in order to avoid certain well-known paradoxes.
The classes of an ontology may be extensional or intensional in nature.
A class is extensional if and only if it is characterized solely by its membership.
More precisely, a class C is extensional if and only if for any class C', if C' has exactly the same members as C, then C and C' are identical.
If a class does not satisfy this condition, then it is intensional.
While extensional classes are more well-behaved and well-understood mathematically, as well as less problematic philosophically, they do not permit the fine grained distinctions that ontologies often need to make.
For example, an ontology may want to distinguish between the class of all creatures with a kidney and the class of all creatures with a heart, even if these classes happen to have exactly the same members.
In the upper ontologies mentioned above, the classes are defined intensionally.
Intensionally defined classes usually have necessary conditions associated with membership in each class.
Some classes may also have sufficient conditions, and in those cases the combination of necessary and sufficient conditions make that class a fully defined class.
Importantly, a class can subsume or be subsumed by other classes; a class subsumed by another is called a subclass of the subsuming class.
For example, Vehicle subsumes Car, since (necessarily) anything that is a member of the latter class is a member of the former.
The subsumption relation is used to create a hierarchy of classes, typically with a maximally general class like Thing at the top, and very specific classes like 2002 Ford Explorer at the bottom.
The critically important consequence of the subsumption relation is the inheritance of properties from the parent (subsuming) class to the child (subsumed) class.
Thus, anything that is necessarily true of a parent class is also necessarily true of all of its subsumed child classes.
In some ontologies, a class is only allowed to have one parent (single inheritance), but in most ontologies, classes are allowed to have any number of parents (multiple inheritance), and in the latter case all necessary properties of each parent are inherited by the subsumed child class.
Thus a particular class of animal (HouseCat) may be a child of the class Cat and also a child of the class Pet.
A partition is a set of related classes and associated rules that allow objects to be placed into the appropriate class.
For example, to the right is the partial diagram of an ontology that has a partition of the Car class into the classes 2-Wheel Drive and 4-Wheel Drive.
The partition rule determines if a particular car is placed in the 2-Wheel Drive or the 4-Wheel Drive class.
If the partition rule(s) guarantee that a single Car cannot be in both classes, then the partition is called a disjoint partition.
If the partition rules ensure that every concrete object in the super-class is an instance of at least one of the partition classes, then the partition is called an exhaustive partition.
Attributes
Objects in the ontology can be described by assigning attributes to them.
Each attribute has at least a name and a value, and is used to store information that is specific to the object it is attached to.
For example the Ford Explorer object has attributes such as:
Name: Ford Explorer
Number-of-doors: 4
Engine: {4.0L, 4.6L}
Transmission: 6-speed
The value of an attribute can be a complex data type; in this example, the value of the attribute called Engine is a list of values, not just a single value.
If you did not define attributes for the concepts you would have either a taxonomy (if hyponym relationships exist between concepts) or a controlled vocabulary.
These are useful, but are not considered true ontologies.
Relationships
An important use of attributes is to describe the relationships (also known as relations) between objects in the ontology.
Typically a relation is an attribute whose value is another object in the ontology.
For example in the ontology that contains the Ford Explorer and the Ford Bronco, the Ford Bronco object might have the following attribute:
Successor: Ford Explorer
This tells us that the Explorer is the model that replaced the Bronco.
Much of the power of ontologies comes from the ability to describe these relations.
Together, the set of relations describes the semantics of the domain.
The most important type of relation is the subsumption relation (is-superclass-of, the converse of is-a, is-subtype-of or is-subclass-of).
This defines which objects are members of classes of objects.
For example we have already seen that the Ford Explorer is-a 4-wheel drive, which in turn is-a Car:
The addition of the is-a relationships has created a hierarchical taxonomy; a tree-like structure (or, more generally, a partially ordered set) that clearly depicts how objects relate to one another.
In such a structure, each object is the 'child' of a 'parent class' (Some languages restrict the is-a relationship to one parent for all nodes, but many do not).
Another common type of relations is the meronymy relation, written as part-of, that represents how objects combine together to form composite objects.
For example, if we extended our example ontology to include objects like Steering Wheel, we would say that "Steering Wheel is-part-of Ford Explorer" since a steering wheel is one of the components of a Ford Explorer.
If we introduce meronymy relationships to our ontology, we find that this simple and elegant tree structure quickly becomes complex and significantly more difficult to interpret manually.
It is not difficult to understand why; an entity that is described as 'part of' another entity might also be 'part of' a third entity.
Consequently, entities may have more than one parent.
The structure that emerges is known as a directed acyclic graph (DAG).
As well as the standard is-a and part-of relations, ontologies often include additional types of relation that further refine the semantics they model.
These relations are often domain-specific and are used to answer particular types of question.
For example in the domain of automobiles, we might define a made-in relationship which tells us where each car is built.
So the Ford Explorer is made-in Louisville.
The ontology may also know that Louisville is-in Kentucky and Kentucky is-a state of the USA.
Software using this ontology could now answer a question like "which cars are made in the U.S.?"
Domain ontologies and upper ontologies
A domain ontology (or domain-specific ontology) models a specific domain, or part of the world.
It represents the particular meanings of terms as they apply to that domain.
For example the word card has many different meanings.
An ontology about the domain of poker would model the "playing card" meaning of the word, while an ontology about the domain of computer hardware would model the "punch card" and "video card" meanings.
An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies.
It contains a core glossary in whose terms objects in a set of domains can be described.
There are several standardized upper ontologies available for use, including Dublin Core, GFO, OpenCyc/ResearchCyc, SUMO, and DOLCEl.
WordNet, while considered an upper ontology by some, is not an ontology: it is a unique combination of a taxonomy and a controlled vocabulary (see above, under Attributes).
The Gellish ontology is an example of a combination of an upper and a domain ontology.
Since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible.
As systems that rely on domain ontologies expand, they often need to merge domain ontologies into a more general representation.
This presents a challenge to the ontology designer.
Different ontologies in the same domain can also arise due to different perceptions of the domain based on cultural background, education, ideology, or because a different representation language was chosen.
At present, merging ontologies is a largely manual process and therefore time-consuming and expensive.
Using a foundation ontology to provide a common definition of core terms can make this process manageable.
There are studies on generalized techniques for merging ontologies, but this area of research is still largely theoretical.
Ontology languages
An ontology language is a formal language used to encode the ontology.
There are a number of such languages for ontologies, both proprietary and standards-based:
OWL is a language for making ontological statements, developed as a follow-on from RDF and RDFS, as well as earlier ontology language projects including OIL, DAML and DAML+OIL.
OWL is intended to be used over the World Wide Web, and all its elements (classes, properties and individuals) are defined as RDF resources, and identified by URIs.
KIF is a syntax for first-order logic that is based on S-expressions.
The Cyc project has its own ontology language called CycL, based on first-order predicate calculus with some higher-order extensions.
Rule Interchange Format (RIF) and F-Logic combine ontologies and rules.
The Gellish language includes rules for its own extension and thus integrates an ontology with an ontology language.
Relation to the philosophical term
The term ontology has its origin in philosophy, where it is the name of one fundamental branch of metaphysics, concerned with analyzing various types or modes of existence, often with special attention to the relations between particulars and universals, between intrinsic and extrinsic properties, and between essence and existence.
According to Tom Gruber at Stanford University, the meaning of ontology in the context of computer science is “a description of the concepts and relationships that can exist for an agent or a community of agents.”
He goes on to specify that an ontology is generally written, “as a set of definitions of formal vocabulary.”
What ontology has in common in both computer science and philosophy is the representation of entities, ideas, and events, along with their properties and relations, according to a system of categories.
In both fields, one finds considerable work on problems of ontological relativity (e.g. Quine and Kripke in philosophy, Sowa and Guarino in computer science (Top-level ontological categories.
By: Sowa, John F.
In International Journal of Human-Computer Studies, v. 43 (November/December 1995) p. 669-85.), and debates concerning whether a normative ontology is viable (e.g. debates over foundationalism in philosophy, debates over the Cyc project in AI).
Differences between the two are largely matters of focus.
Philosophers are less concerned with establishing fixed, controlled vocabularies than are researchers in computer science, while computer scientists are less involved in discussions of first principles (such as debating whether there are such things as fixed essences, or whether entities must be ontologically more primary than processes).
During the second half of the 20th century, philosophers extensively debated the possible methods or approaches to building ontologies, without actually building any very elaborate ontologies themselves.
By contrast, computer scientists were building some large and robust ontologies (such as WordNet and Cyc) with comparatively little debate over how they were built.
In the early years of the 21st century, the interdisciplinary project of cognitive science has been bringing the two circles of scholars closer together.
For example, there is talk of a "computational turn in philosophy" which includes philosophers analyzing the formal ontologies of computer science (sometimes even working directly with the software), while researchers in computer science have been making more references to those philosophers who work on ontology (sometimes with direct consequences for their methods).
Still, many scholars in both fields are uninvolved in this trend of cognitive science, and continue to work independently of one another, pursuing separately their different concerns.
Resources
Examples of published ontologies
Dublin Core, a simple ontology for documents and publishing.
Cyc for formal representation of the universe of discourse.
Suggested Upper Merged Ontology, which is a formal upper ontology
Basic Formal Ontology (BFO), a formal upper ontology designed to support scientific research
Gellish English dictionary, an ontology that includes a dictionary and taxonomy that includes an upper ontology and a lower ontology that focusses on industrial and business applications in engineering, technology and procurement.
Generalized Upper Model, a linguistically-motivated ontology for mediating between clients systems and natural language technology
WordNet Lexical reference system
OBO Foundry: a suite of interoperable reference ontologies in biomedicine.
The Ontology for Biomedical Investigations is an open access, integrated ontology for the description of biological and clinical investigations.
COSMO: An OWL ontology that is a merger of the basic elements of the OpenCyc and SUMO ontologies, with additional elements.
Gene Ontology for genomics
PRO, the Protein Ontology of the Protein Information Resource, Georgetown University.
Protein Ontology for proteomics
Foundational Model of Anatomy for human anatomy
SBO, the Systems Biology Ontology, for computational models in biology
Plant Ontology for plant structures and growth/development stages, etc.
CIDOC CRM (Conceptual Reference Model) - an ontology for "cultural heritage information".
GOLD (General Ontology for Linguistic Description )
Linkbase A formal representation of the biomedical domain, founded upon Basic Formal Ontology (BFO).
Foundational, Core and Linguistic Ontologies
ThoughtTreasure ontology
LPL Lawson Pattern Language
TIME-ITEM Topics for Indexing Medical Education
POPE Purdue Ontology for Pharmaceutical Engineering
IDEAS Group A formal ontology for enterprise architecture being developed by the Australian, Canadian, UK and U.S. Defence Depts. The IDEAS Group Website
program abstraction taxonomy
SWEET Semantic Web for Earth and Environmental Terminology
CCO The Cell-Cycle Ontology is an application ontology that represents the cell cycle
Ontology libraries
The development of ontologies for the Web has led to the apparition of services providing lists or directories of ontologies with search facility.
Such directories have been called ontology libraries.
The following are static libraries of human-selected ontologies.
The DAML Ontology Library maintains a legacy of ontologies in DAML.
The Protege Ontology Library contains a set of owl, Frame-based and other format ontologies.
SchemaWeb is a directory of RDF schemata expressed in RDFS, OWL and DAML+OIL.
The following are both directories and search engines.
They include crawlers searching the Web for well-formed ontologies.
Swoogle is a directory and search engine for all RDF resources available on the Web, including ontologies.
The OntoSelect Ontology Library offers similar services for RDF/S, DAML and OWL ontologies.
Ontaria is a "searchable and browsable directory of semantic web data", with a focus on RDF vocabularies with OWL ontologies.
The OBO Foundry / Bioportalis a suite of interoperable reference ontologies in biology and biomedicine.
OpenOffice.org
OpenOffice.org (OO.o or OOo) is a cross-platform office application suite available for a number of different computer operating systems.
It supports the ISO standard OpenDocument Format (ODF) for data interchange as its default file format, as well as Microsoft Office '97–2003 formats, Microsoft Office '2007 format (in version 3), among many others.
OpenOffice.org was originally derived from StarOffice, an office suite developed by StarDivision and acquired by Sun Microsystems in August 1999.
The source code of the suite was released in July 2000 with the aim of reducing the dominant market share of Microsoft Office by providing a free, open and high-quality alternative; later versions of StarOffice are based upon OpenOffice.org with additional proprietary components.
OpenOffice.org is free software, available under the GNU Lesser General Public License (LGPL).
The project and software are informally referred to as OpenOffice, but this term is a trademark held by another party, requiring the project to adopt OpenOffice.org as its formal name.
History
Originally developed as the proprietary software application suite StarOffice by the German company StarDivision, the code was purchased in 1999 by Sun Microsystems.
In August 1999 version 5.2 of StarOffice was made available free of charge.
On July 19, 2000, Sun Microsystems announced that it was making the source code of StarOffice available for download under both the LGPL and the Sun Industry Standards Source License (SISSL) with the intention of building an open source development community around the software.
The new project was known as OpenOffice.org, and its website went live on October 13, 2000.
Work on version 2.0 began in early 2003 with the following goals: better interoperability with Microsoft Office; better performance, with improved speed and lower memory usage; greater scripting capabilities; better integration, particularly with GNOME; an easier-to-find and use database front-end for creating reports, forms and queries; a new built-in SQL database; and improved usability.
A beta version was released on March 4, 2005.
On September 2, 2005 Sun announced that it was retiring the SISSL.
As a consequence, the OpenOffice.org Community Council announced that it would no longer dual license the office suite, and future versions would use only the LGPL.
On October 20, 2005, OpenOffice.org 2.0 was formally released to the public.
Eight weeks after the release of Version 2.0, an update, OpenOffice.org 2.0.1, was released.
It fixed minor bugs and introduced new features.
As of the 2.0.3 release, OpenOffice.org changed its release cycle from 18-months to releasing updates, feature enhancements and bug fixes every three months.
Currently, new versions including new features are released every six months (so-called "feature releases") alternating with so-called "bug fix releases" which are being released between two feature releases (Every 3 months).
StarOffice
Sun subsidizes the development of OpenOffice.org in order to use it as a base for its commercial proprietary StarOffice application software.
Releases of StarOffice since version 6.0 have been based on the OpenOffice.org source code, with some additional proprietary components, including:
Additional bundled fonts (especially East Asian language fonts).
Adabas D database.
Additional document templates.
Clip art.
Sorting functionality for Asian versions.
Additional file filters.
Migration assessment tool (Enterprise Edition).
Macro migration tool (Enterprise Edition).
Configuration management tool (Enterprise Edition).
OpenOffice.org, therefore, inherited many features from the original StarOffice upon which it was based including the OpenOffice.org XML file format which it retained until version 2, when it was replaced by the ISO standard OpenDocument Format (ODF).
Features
According to its mission statement, the OpenOffice.org project aims "To create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format."
OpenOffice.org aims to compete with Microsoft Office and emulate its look and feel where suitable.
It can read and write most of the file formats found in Microsoft Office, and many other applications; an essential feature of the suite for many users.
OpenOffice.org has been found to be able to open files of older versions of Microsoft Office and damaged files that newer versions of Microsoft Office itself cannot open.
However, it cannot open older Word for Macintosh (MCW) files.
Platforms
Platforms for which OO.o is available include Microsoft Windows, Linux, Solaris, BSD, OpenVMS, OS/2 and IRIX.
The current primary development platforms are Microsoft Windows, Linux and Solaris.
A port for Mac OS X exists for OS X machines which have the X Window System component installed.
A port to OS X's native Aqua user interface is in progress, and is scheduled for completion for the 3.0 milestone.
NeoOffice is an independent fork of OpenOffice, specially adapted for Mac OS X.
Version compatibility
Windows 95: up to v1.1.5
Windows 98-Vista: up to v2.4, development releases of v3.0
Mac OS 10.2: up to v1.1.2
Mac OS 10.3: up to v2.1
Mac OS 10.4-10.5: up to v2.4, development releases of v3.0 (intel only)
OS/2 and eComStation: up to v2.0.4
Components
OpenOffice.org is a collection of applications that work together closely to provide the features expected from a modern office suite.
Many of the components are designed to mirror those available in Microsoft Office.
The components available include:
QuickStarter
A small program for Windows and Linux that runs when the computer starts for the first time.
It loads the core files and libraries for OpenOffice.org during computer startup and allows the suite applications to start more quickly when selected later.
The amount of time it takes to open OpenOffice.org applications was a common complaint in version 1.0 of the suite.
Substantial improvements were made in this area for version 2.2.
The macro recorder
Is used to record user actions and replay them later to help with automating tasks, using OpenOffice.org Basic (see below).
It is not possible to download these components individually on Windows, though they can be installed separately.
Most Linux distributions break the components into individual packages which may be downloaded and installed separately.
OpenOffice.org Basic
OpenOffice.org Basic is a programming language similar to Microsoft Visual Basic for Applications (VBA) based on StarOffice Basic.
In addition to the macros, the upcoming Novell edition of OpenOffice.org 2.0 supports running Microsoft VBA macros, a feature expected to be incorporated into the mainstream version soon.
OpenOffice.org Basic is available in the Writer and Calc applications.
It is written in functions called subroutines or macros, with each macro performing a different task, such as counting the words in a paragraph.
OpenOffice.org Basic is especially useful in doing repetitive tasks that have not been integrated in the program.
As the OpenOffice.org database, called "Base", uses documents created under the Writer application for reports and forms, one could say that Base can also be programmed with OpenOffice.org Basic.
File formats
OpenOffice.org pioneered the ISO/IEC standard OpenDocument file formats (ODF), which it uses natively, by default.
It also supports reading (and in some cases writing) a large number of legacy proprietary file formats (e.g.: WordPerfect through libwpd, StarOffice, Lotus software, MS Works through libwps, Rich Text Format), most notably including Microsoft Office formats after which the OpenDocument specification was "approved for release as an ISO and IEC International Standard" under the name ISO/IEC 26300:2006..
Microsoft Office interoperability
In response to Microsoft's recent movement towards using the Office Open XML format in Microsoft Office 2007, Novell has released an Office Open XML converter for OOo under a liberal BSD license (along with GNU GPL and LGPL licensed libraries), that will be submitted for inclusion into the OpenOffice.org project.
This allows OOo to read and write Microsoft OpenXML-formatted word processing documents (.docx) in OpenOffice.org.
Currently it works only with the latest Novell edition of OpenOffice.org.
Sun Microsystems has developed an ODF plugin for Microsoft Office which enables users of Microsoft Office Word, Excel and PowerPoint to read and write ODF documents.
The plugin currently works with Microsoft Office 2003, Microsoft Office XP and Microsoft Office 2000.
Support for Microsoft Office 2007 is only available in combination with Microsoft Office 2007 SP1.
Several software companies (including Microsoft and Novell) are working on an add-in for Microsoft Office that allows reading and writing ODF files.
Currently it works only for Microsoft Word 2007 / XP / 2003.
Microsoft provides a compatibility pack to read and write Office Open XML files with Office 2000, XP and 2003.
The compatibility pack can also be used as a stand-alone converter with Microsoft Office 97.
This might be helpful to convert older Microsoft Office files via Office Open XML to ODF if a direct conversion doesn't work as expected.
The Office compatibility pack however does not install for Office 2000 or Office XP on Windows 9x.
Note that some office applications built with Microsoft components may refuse to import OpenOffice data.
The Sage Group's Simply Accounting, for example, can import Excel's .xls files, but refuses to accept OpenOffice.org-generated .xls files for the reason that the OOo .xls files are not "genuine Microsoft" .xls files.
Development
Overview
The OpenOffice.org API is based on a component technology known as Universal Network Objects (UNO).
It consists of a wide range of interfaces defined in a CORBA-like interface description language.
The document file format used is based on XML and several export and import filters.
All external formats read by OpenOffice.org are converted back and forth from an internal XML representation.
By using compression when saving XML to disk, files are generally smaller than the equivalent binary Microsoft Office documents.
The native file format for storing documents in version 1.0 was used as the basis of the OASIS OpenDocument file format standard, which has become the default file format in version 2.0.
Development versions of the suite are released every few weeks on the developer zone of the OpenOffice.org website.
The releases are meant for those who wish to test new features or are simply curious about forthcoming changes; they are not suitable for production use.
Native desktop integration
OpenOffice.org 1.0 was criticized for not having the look and feel of applications developed natively for the platforms on which it runs.
Starting with version 2.0, OpenOffice.org uses native widget toolkit, icons, and font-rendering libraries across a variety of platforms, to better match native applications and provide a smoother experience for the user.
There are projects underway to further improve this integration on both GNOME and KDE.
This issue has been particularly pronounced on Mac OS X, whose standard user interface looks noticeably different from either Windows or X11-based desktop environments and requires the use of programming toolkits unfamiliar to most OpenOffice.org developers.
There are two implementations of OpenOffice.org available for OS X:
OpenOffice.org Mac OS X (X11):
This official implementation requires the installation of X11.app or XDarwin, and is a close port of the well-tested Unix version.
It is functionally equivalent to the Unix version, and its user interface resembles the look and feel of that version; for example, the application uses its own menu bar instead of the OS X menu at the top of the screen.
It also requires system fonts to be converted to X11 format for OpenOffice.org to use them (which can be done during application installation).
OpenOffice.org Aqua:
After a first step (completed) using Carbon, OpenOffice.org Aqua switched to Cocoa technology, and an Aqua version (based on Cocoa) is also being developed under the aegis of OpenOffice.org, with a Beta version currently available.
Sun Microsystems is collaborating with OOo to further development of the Aqua version of OpenOffice.org for Mac.
Future
Currently, a developed preview of OpenOffice.org 3 (OOo-dev 3.0) is available for download.
Among the planned features for OOo 3.0, set to be released by September 2008 , are:
Personal Information Manager (PIM), probably based on Thunderbird/Lightning
PDF import into Draw (to maintain correct layout of the original PDF)
OOXML document support for opening documents created in Office 2007
Support for Mac OS X Aqua platform
Extensions, to add third party functionality.
Presenter screen in Impress with multi-screen support
Other projects
A number of products are derived from OpenOffice.org.
Among the more well-known ones are Sun StarOffice and NeoOffice.
The OpenOffice.org site also lists a large variety of complementary products including groupware solutions.
NeoOffice
NeoOffice is an independent port that integrates with OS X’s Aqua user interface using Java, Carbon and (increasingly) Cocoa toolkits.
NeoOffice adheres fairly closely to OS X UI standards (for example, using native pull-down menus), and has direct access to OS X’s installed fonts and printers.
Its releases lag behind the official OpenOffice.org X11 releases, due to its small development team and the concurrent development of the technology used to port the user interface.
Other projects run alongside the main OpenOffice.org project and are easier to contribute to.
These include documentation, internationalisation and localisation and the API.
OpenGroupware.org
OpenGroupware.org is a set of extension programs to allow the sharing of OpenOffice.org documents, calendars, address books, e-mails, instant messaging and blackboards, and provide access to other groupware applications.
There is also an effort to create and share assorted document templates and other useful additions at OOExtras.
A set of Perl extensions is available through the CPAN in order to allow OpenOffice.org document processing by external programs.
These libraries do not use the OpenOffice.org API.
They directly read or write the OpenOffice.org files using Perl standard file compression/decompression, XML access and UTF-8 encoding modules.
Portable
A distribution of OpenOffice.org called OpenOffice.org Portable is designed to run the suite from a USB flash drive.
OxygenOffice Professional
An enhancement of OpenOffice.org, providing: Current Version: 2.4
Possibility to run Visual Basic for Application (VBA) macros in Calc (for testing)
Improved Calc HTML export
Enhanced Access support for Base
Security fixes
Enhanced performance
Enhanced color-palette
Enhanced help menu, additional User’s Manual, and extended tips for beginners
Optionally it provides, free for personal and professional use:
More than 3,200 graphics, both clip art and photos.
Several templates and sample documents
Over 90 free fonts.
Additional tools like OOoWikipedia
Extensions
Since version 2.0.4, OpenOffice.org has supported extensions in a similar manner to Mozilla Firefox.
Extensions make it easy to add new functionality to an existing OpenOffice.org installation.
The OpenOffice.org Extension Repository lists already more than 80 extensions.
Developers can easily build new extensions for OpenOffice.org, for example by using the OpenOffice.org API Plugin for NetBeans.
The OpenOffice.org Bibliographic Project
This aims to incorporate a powerful reference management software into the suite.
The new major addition is slated for inclusion with the standard OpenOffice.org release on late-2007 to mid-2008, or possibly later depending upon the availability of programmers.
Security
OpenOffice.org includes a security team, and as of June 2008 the security organization Secunia reports no known unpatched security flaws for the software.
Kaspersky Lab has shown a proof of concept virus for OpenOffice.org.
This shows OOo viruses are possible, but there is no known virus "in the wild".
In a private meeting of the French Ministry of Defense, macro-related security issues were raised.
OpenOffice.org developers have responded and noted that the supposed vulnerability had not been announced through "well defined procedures" for disclosure and that the ministry had revealed nothing specific.
However, the developers have been in talks with the researcher concerning the supposed vulnerability.
As with Microsoft Word, documents created in OpenOffice can contain metadata which may include a complete history of what was changed, when and by whom.
Ownership
The project and software are informally referred to as OpenOffice, but project organizers report that this term is a trademark held by another party, requiring them to adopt OpenOffice.org as its formal name.
(Due to a similar trademark issue, the Brazilian Portuguese version of the suite is distributed under the name BrOffice.org.)
Development is managed by staff members of StarOffice.
Some delay and difficulty in implementing external contributions to the core codebase (even those from the project's corporate sponsors) has been noted.
Currently, there are several derived and/or proprietary works based on OOo, with some of them being:
Sun Microsystem's StarOffice, with various complementary add-ons.
IBM's Lotus Symphony, with a new interface based on Eclipse (based on OO.o 1.x).
OpenOffice.org Novell edition, integrated with Evolution and with a OOXML filter.
Beijing Redflag Chinese 2000's RedOffice, fully localized in Chinese characters.
Planamesa's NeoOffice for Mac OS X with Aqua support via Java.
In May 23, 2007, the OpenOffice.org community and Redflag Chinese 2000 Software Co, Ltd. announced a joint development effort focused on integrating the new features that have been added in the RedOffice localization of OpenOffice.org, as well as quality assurance and work on the core applications.
Additionally, Redflag Chinese 2000 made public its commitment to the global OO.o community stating it would "strengthen its support of the development of the world's leading free and open source productivity suite", adding around 50 engineers (that have been working on RedOffice since 2006) to the project.
In September 10, 2007, the OO.o community announced that IBM had joined to support the development of OpenOffice.org.
"IBM will be making initial code contributions that it has been developing as part of its Lotus Notes product, including accessibility enhancements, and will be making ongoing contributions to the feature richness and code quality of OpenOffice.org.
Besides working with the community on the free productivity suite's software, IBM will also leverage OpenOffice.org technology in its products" as has been seen with Lotus Symphony.
Sean Poulley, the vice president of business and strategy in IBM's Lotus Software division said that IBM plans to take a leadership role in the OpenOffice.org community together with other companies such as Sun Microsystems.
IBM will work within the leadership structure that exists.
As of October 02, 2007, Michael Meeks announced (and generated an answer by Sun's Simon Phipps and Mathias Bauer) a derived OpenOffice.org work, under the wing of his employer Novell, with the purpose of including new features and fixes that do not get easily integrated in the OOo-build up-stream core.
The work is called Go-OO (http://go-oo.org/) a name under which alternative OO.o software has been available for five years.
The new features are shared with Novell's edition of OOo and include:
VBA macros support.
Faster start up time.
"A linear optimization solver to optimize a cell value based on arbitrary constraints built into Calc".
Multimedia content supports into documents, using the gstreamer multimedia framework.
Support for Microsoft Works formats, WordPerfect graphics (WPG format) and T602 files imports.
Details about the patch handling including metrics can be found on the OpenOffice.org site.
Reactions
Federal Computer Week issue listed OpenOffice.org as one of the "5 stars of open-source products."
In contrast, OpenOffice.org was used in 2005 by The Guardian newspaper to illustrate what it claims are the limitations of open-source software, although the article does finish by stating that the software may be better than MS Word for books.
Market share
It is extremely difficult to estimate the market share of OpenOffice.org due to the fact that OpenOffice.org can be freely distributed via download sites including mirrors, peer-to-peer networks, CDs, Linux distros, etc.
Nevertheless, the OpenOffice.org tries to capture key adoption data in a market share analysis
Although Microsoft Office retains 95% of the general market as measured by revenue, OpenOffice.org and StarOffice have secured 14% of the large enterprise market as of 2004 and 19% of the small to midsize business market in 2005.
The OpenOffice.org web site reports more than 98 million downloads.
Other large scale users of OpenOffice.org include Singapore’s Ministry of Defence, and Bristol City Council in the UK.
In France, OpenOffice.org has attracted the attention of both local and national government administrations who wish to rationalize their software procurement, as well as have stable, standard file formats for archival purposes.
It is now the official office suite for the French Gendarmerie.
Several government organizations in India, such as IIT Bombay (a renowned technical institute), the Supreme Court of India, the Allahabad High Court, which use Linux, completely rely on OpenOffice.org for their administration.
On October 4, 2005, Sun and Google announced a strategic partnership.
As part of this agreement, Sun will add a Google search bar to OpenOffice.org, Sun and Google will engage in joint marketing activities as well as joint research and development, and Google will help distribute OpenOffice.org.
Google is currently distributing StarOffice as part of the Google Pack.
Besides StarOffice, there are still a number of OpenOffice.org derived commercial products.
Most of them are developed under SISSL license (which is valid up to OpenOffice.org 2.0 Beta 2).
In general they are targeted at local or niche market, with proprietary add-ons such as speech recognition module, automatic database connection, or better CJK support.
In July 2007 Everex, a division of First International Computer and the 9th largest PC supplier in the U.S., began shipping systems preloaded with OpenOffice.org 2.2 into Wal-Mart and Sam's Club throughout North America.
In September 2007 IBM announced that it would supply and support OpenOffice.org branded as Lotus Symphony, and integrated into Lotus Notes.
IBM also announced 35 developers would be assigned to work on OpenOffice.org, and that it would join the OpenOffice.org foundation.
Commentators noted parallels between IBM's 2000 support of Linux and this announcement.
Java controversy
In the past OpenOffice.org was criticized for an increasing dependency on the Java Runtime Environment which was not free software.
That Sun Microsystems is both the creator of Java and the chief supporter of OpenOffice.org drew accusations of ulterior motives for this technology choice.
Version 1 depended on the Java Runtime Environment (JRE) being present on the user’s computer for some auxiliary functions, but version 2 increased the suite’s use of Java requiring a JRE.
In response, Red Hat increased their efforts to improve free Java implementations.
Red Hat’s Fedora Core 4 (released on June 13, 2005) included a beta version of OpenOffice.org version 2, running on GCJ and GNU Classpath.
The issue of OpenOffice.org’s use of Java came to the fore in May 2005, when Richard Stallman appeared to call for a fork of the application in a posting on the Free Software Foundation website.
This led to discussions within the OpenOffice.org community and between Sun staff and developers involved in GNU Classpath, a free replacement for Sun’s Java implementation.
Later that year, the OpenOffice.org developers also placed into their development guidelines various requirements to ensure that future versions of OpenOffice.org could be run on free implementations of Java and fixed the issues which previously prevented OpenOffice.org 2.0 from using free software Java implementations.
On November 13, 2006, Sun committed to releasing Java under the GNU General Public License in the near future.
This process would end OpenOffice.org's dependence on non-free software.
Between November 2006 and May 2007, Sun Microsystems made available most of their Java technologies under the GNU General Public License, in compliance with the specifications of the Java Community Process, thus making almost all of Sun's Java also free software.
The following areas of OpenOffice.org 2.0 depend on the JRE being present:
The media player on Unix-like systems
All document wizards in Writer
Accessibility tools
Report Autopilot
JDBC driver support
HSQL database engine, which is used in OpenOffice.org Base
XSLT filters
BeanShell, the NetBeans scripting language and the Java UNO bridge
Export filters to the Aportis.doc (.pdb) format for the Palm OS or Pocket Word (.psw) format for the Pocket PC
Export filter to LaTeX
Export filter to MediaWiki's wikitext
A common point of confusion is that mail merge to generate emails requires the Java API JavaMail in StarOffice; however, as of version 2.0.1, OpenOffice.org uses a Python-component instead.
Complementary software
OpenOffice.org provides replacement for MS Office's Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Access, Microsoft Equation Editor and Microsoft Visio.
But to level the equivalent functionality from the rest of MS Office, OOo can be complemented with other open source programs such as:
Evolution or Thunderbird/Lightning for a PIM like Microsoft Outlook.
OpenProj (which seeks integration with OOo, but might be limited due to licensing issues) for Microsoft Project.
Scribus for Microsoft Publisher
O3spaces for Sharepoint
Microsoft also provides Administrative Template Files ("adm files") that allow MS Office to be configured using Windows Group Policy.
Equivalent functionality for OpenOffice.org is provided by OpenOffice-Enterprise, a commercial product from Open Office Technology, Inc.
Issues
OpenOffice.org has been criticized for slow start times and extensive CPU and RAM usage in comparison to other competitive software such as Microsoft Office.
In comparison, tests between OpenOffice.org 2.2 and Microsoft Office 2007 have found that OpenOffice.org takes approximately 2 times the processing time and memory to load itself along with a blank file; and took approximately 4.7 times the processing time and 3.9 times the memory to open an extremely large spreadsheet file.
Critics have pointed to excessive code bloat and OpenOffice.org's loading of the Java Runtime Environment as possible reasons for the slow speeds and excessive memory usage.
However, since OpenOffice.org 2.2 the performance of OpenOffice.org has been improved dramatically.
One of the greatest challenges is its ability to be truly cross compatible with other applications.
Since Openoffice.org is forced to reverse engineer proprietary binary formats due to unavailability of open specifications, slight formatting incompatibilities tend to exist when files are saved in non-native format.
For example, a complex .doc document formatted under OpenOffice.org, is usually not displayed with the correct format when opened with Microsoft Office.
Retail
The free software license under which OpenOffice.org is distributed allows unlimited use of the software for both home and business use, including unlimited redistribution of the software.
Several businesses sell the OpenOffice.org suite on auction websites such as eBay, offering value-added services such as 24/7 technical support, download mirrors, and CD mailing.
However, often the 24/7 support offered is not provided by the company selling the software, but rather by the official OpenOffice.org mailing list.
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine grammatical structure with respect to a given (more or less) formal grammar.
A parser is thus one of the components in an interpreter or compiler, where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of parse tree, abstract syntax tree or other hierarchical structure) and normally checks for syntax errors at the same time.
The parser often uses a separate lexical analyser to create tokens from the sequence of input characters.
Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as Yacc) from a grammar written in Backus-Naur form.
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin.
Parsers can also be constructed as executable specifications of grammars in functional programming languages.
Frost, Hafiz and Callaghan have built on the work of others to construct a set of higher-order functions (called parser combinators) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions.
The X-SAIGA site has more about the algorithms and implementation details.
Human languages
Also see :Category:Natural language parsing
In some machine translation and natural language processing systems, human languages are parsed by computer programs.
Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language.
In order to parse natural language data, researchers must first agree on the grammar to be used.
The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete.
Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank.
Shallow parsing aims to find only the boundaries of major constituents such as noun phrases.
Another popular strategy for avoiding linguistic controversy is dependency grammar parsing.
Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand).
This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts.
(See machine learning.)
Approaches which have been used include straightforward PCFGs (probabilistic context free grammars), maximum entropy, and neural nets.
Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech).
However such systems are vulnerable to overfitting and require some kind of smoothing to be effective.
Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages.
As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass.
Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time.
(See chart parsing.)
However some systems trade speed for accuracy using, eg, linear-time versions of the shift-reduce algorithm.
A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.
It is normally branching of one part and its subparts
Programming languages
The most common use of a parser is as a component of a compiler or interpreter.
This parses the source code of a computer programming language to create some form of internal representation.
Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them.
Parsers are written by hand or generated by parser generators.
Context-free grammars are limited in the extent to which they can express all of the requirements of a language.
Informally, the reason is that the memory of such a language is limited.
The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced.
More powerful grammars that can express this constraint, however, cannot be parsed efficiently.
Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.
Overview of process
[[image:Parser_Flow.gif|right|Flow of data in a typical parser]] The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions.
For example, a calculator program would look at an input such as "
" and split it into the tokens
,
,
,
,
,
,
,
, and
, each of which is a meaningful symbol in the context of an arithmetic expression.
The parser would contain rules to tell it that the characters
,
,
,
and
mark the start of a new token, so meaningless tokens like "
" or "
" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression.
This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear.
However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers.
These rules can be formally expressed with attribute grammars.
The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action.
In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code.
Attribute grammars can also be used to define these actions.
Types of parsers
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar.
This can be done in essentially two ways:
Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse-trees using a top-down expansion of the given formal grammar rules.
Tokens are consumed from left to right.
Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules .
LL parsers and recursive-descent parser are examples of top-down parsers, which cannot accommodate left recursive productions.
Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodates ambiguity and left recursion in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees.
Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol.
Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on.
LR parsers are examples of bottom-up parsers.
Another term used for this type of parser is Shift-Reduce parsing.
Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar).
LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse) .
Examples of parsers
Top-down parsers
Some of the parsers that use top-down parsing include:
Recursive descent parser
LL parser (Left-to-right, Leftmost derivation)
X-SAIGA - eXecutable SpecificAtIons of GrAmmars.
Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
Bottom-up parsers
Some of the parsers that use bottom-up parsing include:
Precedence parser
Operator-precedence parser
Simple precedence parser
BC (bounded context) parsing
LR parser (Left-to-right, Rightmost derivation)
Simple LR (SLR) parser
LALR parser
Canonical LR (LR(1)) parser
GLR parser
CYK parser
Lexical category
In grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question.
Common linguistic categories include noun and verb, among others.
There are open word classes, which constantly acquire new members, and closed word classes, which acquire new members infrequently if at all.
Different languages may have different lexical categories, or they might associate different properties to the same one.
For example, Japanese has at least three classes of adjectives where English has one; Chinese and Japanese have measure words while European languages have nothing resembling them; many languages don't have a distinction between adjectives and adverbs, or adjectives and nouns, etc.
Many linguists argue that the formal distinctions between parts of speech must be made within the framework of a specific language or language family, and should not be carried over to other languages or language families.
History
The classification of words into lexical categories is found from the earliest moments in the history of linguistics.
In the Nirukta, written in the 5th or 6th century BCE, the Sanskrit grammarian Yāska defined four main categories of words :
nāma - nouns or substantives
ākhyāta - verbs
upasarga - pre-verbs or prefixes
nipāta - particles, invariant words (perhaps prepositions)
These four were grouped into two large classes: inflected (nouns and verbs) and uninflected (pre-verbs and particles).
A century or two later, the Greek scholar Plato wrote in the Cratylus dialog that "... sentences are, I conceive, a combination of verbs [rhēma] and nouns [ónoma]".
Another class, "conjunctions" (covering conjunctions, pronouns, and the article), was later added by Aristotle.
By the end of the 2nd century BCE, the classification scheme had been expanded into eight categories, seen in the Tékhnē grammatiké:
Noun: a part of speech inflected for case, signifying a concrete or abstract entity
Verb: a part of speech without case inflection, but inflected for tense, person and number, signifying an activity or process performed or undergone
Participle: a part of speech sharing the features of the verb and the noun
Article: a part of speech inflected for case and preposed or postposed to nouns (the relative pronoun is meant by the postposed article)
Pronoun: a part of speech substitutable for a noun and marked for person
Preposition: a part of speech placed before other words in composition and in syntax
Adverb: a part of speech without inflection, in modification of or in addition to a verb
Conjunction: a part of speech binding together the discourse and filling gaps in its interpretation
The Latin grammarian Priscian (fl. 500 CE) modified the above eight-fold system, substituting "interjection" for "article".
It wasn't until 1767 that the adjective was taken as a separate class.
Traditional English grammar is patterned after the European tradition above, and is still taught in schools and used in dictionaries.
It names eight parts of speech: noun, verb, adjective, adverb, pronoun, preposition, conjunction, and interjection (sometimes called an exclamation).
Controversies
Since the Greek grammarians of 2nd century BCE, parts of speech have been defined by morphological, syntactic and semantic criteria.
However, there is currently no generally agreed-upon classification scheme that can apply to all languages, or even a set of criteria upon which such a scheme should be based.
Linguists recognize that the above list of eight word classes is simplified and artificial.
For example, "adverb" is to some extent a catch-all class that includes words with many different functions.
Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.
Functional classification
Common ways of delimiting words by function include:
Open word classes:
adjectives
adverbs
interjections
nouns
verbs (except auxiliary verbs)
Closed word classes:
auxiliary verbs
clitics
coverbs
conjunctions
Determiners (articles, quantifiers, demonstrative adjectives, and possessive adjectives)
particles
measure words
adpositions (prepositions, postpositions, and circumpositions)
preverbs
pronouns
contractions
cardinal numbers
English
English frequently does not mark words as belonging to one part of speech or another.
Words like neigh, break, outlaw, laser, microwave and telephone might all be either verb forms or nouns.
Although -ly is an adverb marker, not all adverbs end in -ly and not all words ending in -ly are adverbs.
For instance, tomorrow, slow, fast, crosswise can all be adverbs, while early, friendly, ugly are all adjectives (though early can also function as an adverb).
In certain circumstances, even words with primarily grammatical functions can be used as verbs or nouns, as in "We must look to the hows and not just the whys" or "Miranda was to-ing and fro-ing and not paying attention".
Part-of-speech tagging
Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph.
A simplified form of this is commonly taught school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
History
Research on part-of-speech tagging has been closely tied to corpus linguistics.
The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis, in the mid-1960s.
It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications.
Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years.
A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all.
For example, article then noun can occur, but article verb (arguably) cannot.
The program got about 70% correct.
Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases even human speakers might not agree on).
This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages.
Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and VOLSUNGA.
However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus.
For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context.
This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English.
HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences.
For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal.
The same method can of course be used to benefit from knowledge about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences.
So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but even less likely another verb.
When several ambiguous words occur together, the possibilities multiply.
However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn.
The combination with highest probability is then chosen.
The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing www.cs.brown.edu/people/ec/home.html, that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous.
CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities.
It sometimes had to resort to backup methods when there were simply too many (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech).
In 1987, Steve DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time.
Their methods were similar to the Viterbi algorithm known for some time in other fields.
DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus).
Both methods achieved accuracy over 95%.
DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.
These findings were surprisingly disruptive to the field of Natural Language Processing.
The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on.
CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare.
This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well.
Markov Models are now the standard method for part-of-speech assignment.
The methods already discussed involve working from a pre-existing corpus to learn tag probabilities.
It is, however, also possible to bootstrap using "unsupervised" tagging.
Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction.
That is, they observe patterns in word use, and derive part-of-speech categories themselves.
For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones.
With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
These two categories can be further subdivided into rule-based, stochastic, and neural approaches.
Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, and the Baum-Welch algorithm (also known as the forward-backward algorithm).
Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm.
Pattern recognition
Pattern recognition is a sub-topic of machine learning.
It can be defined as
"the act of taking in raw data and taking an action based on the category of the data".
Most research in pattern recognition is about methods for supervised learning and unsupervised learning.
Pattern recognition aims to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns.
The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space.
This is in contrast to pattern matching, where the pattern is rigidly specified.
Overview
A complete pattern recognition system consists of a sensor that gathers the observations to be classified or described; a feature extraction mechanism that computes numeric or symbolic information from the observations; and a classification or description scheme that does the actual job of classifying or describing observations, relying on the extracted features.
The classification or description scheme is usually based on the availability of a set of patterns that have already been classified or described.
This set of patterns is termed the training set and the resulting learning strategy is characterized as supervised learning.
Learning can also be unsupervised, in the sense that the system is not given an a priori labeling of patterns, instead it establishes the classes itself based on the statistical regularities of the patterns.
The classification or description scheme usually uses one of the following approaches: statistical (or decision theoretic), syntactic (or structural).
Statistical pattern recognition is based on statistical characterisations of patterns, assuming that the patterns are generated by a probabilistic system.
Syntactical (or structural) pattern recognition is based on the structural interrelationships of features.
A wide range of algorithms can be applied for pattern recognition, from very simple Bayesian classifiers to much more powerful neural networks.
An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
Pattern recognition is more complex when templates are used to generate variants.
For example, in English, sentences often follow the "N-VP" (noun - verb phrase) pattern, but some knowledge of the English language is required to detect the pattern.
Pattern recognition is studied in many fields, including psychology, ethology, and computer science.
Holographic associative memory is another type of pattern matching scheme where a target small patterns can be searched from a large set of learned patterns based on cognitive meta-weight.
Uses
Within medical science pattern recognition creates the basis for computer-aided diagnosis (CAD) systems.
CAD describes a procedure that supports the doctor's interpretations and findings.
Typical applications are automatic speech recognition, classification of text into several categories (e.g. spam/non-spam email messages), the automatic recognition of handwritten postal codes on postal envelopes, or the automatic recognition of images of human faces.
The last two examples form the subtopic image analysis of pattern recognition that deals with digital images as input to pattern recognition systems.
Phrase
In grammar, a phrase is a group of words that functions as a single unit in the syntax of a sentence.
For example the house at the end of the street (example 1) is a phrase.
It acts like a noun.
It contains the phrase at the end of the street (example 2), a prepositional phrase which acts like an adjective.
Example 2 could be replaced by white, to make the phrase the white house.
Examples 1 and 2 contain the phrase the end of the street (example 3) which acts like a noun.
It could be replaced by the cross-roads to give the house at the cross-roads.
Most phrases have a or central word which defines the type of phrase.
This word is called the head of the phrase.
In English the head is often the first word of the phrase.
Some phrases, however, can be headless.
For example, the rich is a noun phrase composed of a determiner and an adjective, but no noun.
Phrases may be classified by the type of head they take
Prepositional phrase (PP) with a preposition as head (e.g. in love, over the rainbow).
Languages that use postpositions instead have postpositional phrases.
The two types are sometimes commonly referred to as adpositional phrases.
Noun phrase (NP) with a noun as head (e.g. the black cat, a cat on the mat)
Verb phrase (VP) with a verb as head (e.g. eat cheese, jump up and down)
Adjectival phrase with an adjective as head (e.g. full of toys)
Adverbial phrase with adverb as head (e.g. very carefully)
Formal definition
A phrase is a syntactic structure which has syntactic properties derived from its head.
Complexity
A complex phrase consists of several words, whereas a simple phrase consists of only one word.
This terminology is especially often used with verb phrases:
simple past and present are simple verb, which require just one verb
complex verb have one or two aspects added, hence require additional two or three words
"Complex", which is phrase-level, is often confused with "compound", which is word-level.
However, there are certain phenomena that formally seem to be phrases but semantically are more like compounds, like "women's magazines", which has the form of a possessive noun phrase, but which refers (just like a compound) to one specific lexeme (i.e. a magazine for women and not some magazine owned by a woman).
Semiotic approaches to the concept of "phrase"
In more semiotic approaches to language, such as the more cognitivist versions of construction grammar, a phrasal structure is not only a certain formal combination of word types whose features are inherited from the head.
Here each phrasal structure also expresses some type of conceptual content, be it specific or abstract.
Portuguese language
Portuguese ( or língua portuguesa) is a Romance language that originated in what is now Galicia (Spain) and northern Portugal from the Latin spoken by romanized Pre-Roman peoples of the Iberian Peninsula (namely the Gallaeci, the Lusitanians, the Celtici and the Conii) about 2000 years ago.
It spread worldwide in the 15th and 16th centuries as Portugal established a colonial and commercial empire (1415–1999) which spanned from Brazil in the Americas to Goa in India and Macau in China, in fact it was used exclusively on the island of Sri Lanka as the lingua franca for almost 350 years.
During that time, many creole languages based on Portuguese also appeared around the world, especially in Africa, Asia, and the Caribbean.
Today it is one of the world's major languages, ranked 6th according to number of native speakers (approximately 177 million).
It is the language with the largest number of speakers in South America, spoken by nearly all of Brazil's population, which amounts to over 51% of the continent's population even though it is the only Portuguese-speaking nation in the Americas.
It is also a major lingua franca in Portugal's former colonial possessions in Africa.
It is the official language of ten countries (see the table on the right), also being co-official with Spanish and French in Equatorial Guinea, with Cantonese Chinese in the Chinese special administrative region of Macau, and with Tetum in East Timor.
There are sizable communities of Portuguese-speakers in various regions of North America, notably in the United States (New Jersey, New England and south Florida) and in Ontario, Canada.
Spanish author Miguel de Cervantes once called Portuguese "the sweet language", while Brazilian writer Olavo Bilac poetically described it as a última flor do Lácio, inculta e bela: "the last flower of Latium, wild and beautiful".
Geographic distribution
Today, Portuguese is the official language of Angola, Brazil, Cape Verde, Guinea-Bissau, Portugal, São Tomé and Príncipe and Mozambique.
It is also one of the official languages of Equatorial Guinea (with Spanish and French), the Chinese special administrative region of Macau (with Chinese), and East Timor, (with Tetum).
It is a native language of most of the population in Portugal (100%), Brazil (99%), Angola (60%), and São Tomé and Príncipe (50%), and it is spoken by a plurality of the population of Mozambique (40%), though only 6.5% are native speakers.
No data is available for Cape Verde, but almost all the population is bilingual, and the monolingual population speaks Cape Verdean Creole.
Small Portuguese-speaking communities subsist in former overseas colonies of Portugal such as Macau, where it is spoken as a first language by 0.6% of the population and East Timor.
Uruguay gave Portuguese an equal status to Spanish in its educational system at the north border with Brazil.
In the rest of the country, it's taught as an obligatory subject beginning by the 6th grade.
It is also spoken by substantial immigrant communities, though not official, in Andorra, France, Luxembourg, Jersey (with a statistically significant Portuguese-speaking community of approximately 10,000 people), Paraguay, Namibia, South Africa, Switzerland, Venezuela and in the U.S. states of California, Connecticut, Florida, Massachusetts, New Jersey, New York and Rhode Island.
In some parts of India, such as Goa and Daman and Diu Portuguese is still spoken.
There are also significant populations of Portuguese speakers in Canada (mainly concentrated in and around Toronto) Bermuda and Netherlands Antilles.
Portuguese is an official language of several international organizations.
The Community of Portuguese Language Countries (with the Portuguese acronym CPLP) consists of the eight independent countries that have Portuguese as an official language.
It is also an official language of the European Union, Mercosul, the Organization of American States, the Organization of Ibero-American States, the Union of South American Nations, and the African Union (one of the working languages) and one of the official languages of other organizations.
The Portuguese language is gaining popularity in Africa, Asia, and South America as a second language for study.
Portuguese and Spanish are the fastest-growing European languages, and, according to estimates by UNESCO, Portuguese is the language with the highest potential for growth as an international language in southern Africa and South America.
The Portuguese-speaking African countries are expected to have a combined population of 83 million by 2050.
Since 1991, when Brazil signed into the economic market of Mercosul with other South American nations, such as Argentina, Uruguay, and Paraguay, there has been an increase in interest in the study of Portuguese in those South American countries.
The demographic weight of Brazil in the continent will continue to strengthen the presence of the language in the region.
Although in the early 21st century, after Macau was ceded to China in 1999, the use of Portuguese was in decline in Asia, it is becoming a language of opportunity there; mostly because of East Timor's boost in the number of speakers in the last five years but also because of increased Chinese diplomatic and financial ties with Portuguese-speaking countries.
In July 2007, President Teodoro Obiang Nguema announced his government's decision to make Portuguese Equatorial Guinea's third official language, in order to meet the requirements to apply for full membership of the Community of Portuguese Language Countries.
This upgrading from its current Associate Observer condition would result in Equatorial Guinea being able to access several professional and academic exchange programs and the facilitation of cross-border circulation of citizens.
Its application is currently being assessed by other CPLP members.
In March 1994 the Bosque de Portugal (Portugal's Woods) was founded in the Brazilian city of Curitiba.
The park houses the Portuguese Language Memorial, which honors the Portuguese immigrants and the countries that adopted the Portuguese language.
Originally there were seven nations represented with pillars, but the independence of East Timor brought yet another pillar for that nation in 2007.
In March 2006, the Museum of the Portuguese Language, an interactive museum about the Portuguese language, was founded in São Paulo, Brazil, the city with the largest number of Portuguese speakers in the world.
Dialects
Portuguese is a pluricentric language with two main groups of dialects, those of Brazil and those of the Old World.
For historical reasons, the dialects of Africa and Asia are generally closer to those of Portugal than the Brazilian dialects, although in some aspects of their phonetics, especially the pronunciation of unstressed vowels, they resemble Brazilian Portuguese more than European Portuguese.
They have not been studied as widely as European and Brazilian Portuguese.
Audio samples of some dialects of Portuguese are available below.
There are some differences between the areas but these are the best approximations possible.
For example, the caipira dialect has some differences from the one of Minas Gerais, but in general it is very close.
A good example of Brazilian Portuguese may be found in the capital city, Brasília, because of the generalized population from all parts of the country.
Angola
Benguelense — Benguela province.
Luandense — Luanda province.
Sulista — South of Angola.
Brazil
Caipira — States of São Paulo (countryside; the city of São Paulo and the eastern areas of the state have their own dialect, called paulistano); southern Minas Gerais, northern Paraná, Goiás and Mato Grosso do Sul.
Cearense — Ceará.
Baiano — Bahia.
Fluminense — Variants spoken in the states of Rio de Janeiro and Espírito Santo (excluding the city of Rio de Janeiro and its adjacent metropolitan areas, which have their own dialect, called carioca).
Gaúcho — Rio Grande do Sul.
(There are many distinct accents in Rio Grande do Sul, mainly due to the heavy influx of European immigrants of diverse origins, those which have settled several colonies throughout the state.)
Mineiro — Minas Gerais (not prevalent in the Triângulo Mineiro, southern and southeastern Minas Gerais).
Nordestino — northeastern states of Brazil (Pernambuco and Rio Grande do Norte have a particular way of speaking).
Nortista — Amazon Basin states.
Paulistano — Variants spoken around São Paulo city and the eastern areas of São Paulo state.
Sertanejo — States of Goiás and Mato Grosso (the city of Cuiabá has a particular way of speaking).
Sulista — Variants spoken in the areas between the northern regions of Rio Grande do Sul and southern regions of São Paulo state.
(The cities of Curitiba, Florianópolis, and Itapetininga have fairly distinct accents as well.)
Portugal
Açoriano (Azorean) — Azores.
Alentejano — Alentejo
Algarvio — Algarve (there is a particular dialect in a small part of western Algarve).
Alto-Minhoto — North of Braga (hinterland).
Baixo-Beirão; Alto-Alentejano — Central Portugal (hinterland).
Beirão — Central Portugal.
Estremenho — Regions of Coimbra and Lisbon (the Lisbon dialect has some peculiar features not shared with the one of Coimbra).
Madeirense (Madeiran) — Madeira.
Nortenho — Regions of Braga and Porto.
Transmontano — Trás-os-Montes e Alto Douro.
Other countries
Cape Verde — Português cabo-verdiano (Cape Verdean Portuguese)
Daman and Diu, India — Damaense.
East Timor — Timorense (East Timorese)
Goa, India — Goês.
Guinea-Bissau — Guineense (Guinean Portuguese).
Macau, China — Macaense (Macanese)
Mozambique — Moçambicano (Mozambican)
São Tomé and Príncipe — Santomense
Uruguay — Dialectos Portugueses del Uruguay (DPU).
Differences between dialects are mostly of accent and vocabulary, but between the Brazilian dialects and other dialects, especially in their most coloquial forms, there can also be some grammatical differences.
The Portuguese-based creoles spoken in various parts of Africa, Asia, and the Americas are independent languages which should not be confused with Portuguese itself.
History
Arriving in the Iberian Peninsula in 216 BC, the Romans brought with them the Latin language, from which all Romance languages descend.
The language was spread by arriving Roman soldiers, settlers and merchants, who built Roman cities mostly near the settlements of previous civilizations.
Between AD 409 and 711, as the Roman Empire collapsed in Western Europe, the Iberian Peninsula was conquered by Germanic peoples (Migration Period).
The occupiers, mainly Suebi and Visigoths, quickly adopted late Roman culture and the Vulgar Latin dialects of the peninsula.
After the Moorish invasion of 711, Arabic became the administrative language in the conquered regions, but most of the population continued to speak a form of Romance commonly known as Mozarabic.
The influence exerted by Arabic on the Romance dialects spoken in the Christian kingdoms of the north was small, affecting mainly their lexicon.
The earliest surviving records of a distinctively Portuguese language are administrative documents of the 9th century, still interspersed with many Latin phrases.
Today this phase is known as Proto-Portuguese (between the 9th and the 12th centuries).
In the first period of Old Portuguese — Galician-Portuguese Period (from the 12th to the 14th century) — the language gradually came into general use.
For some time, it was the language of preference for lyric poetry in Christian Hispania, much like Occitan was the language of the poetry of the troubadours.
Portugal was formally recognized as an independent kingdom by the Kingdom of Leon in 1143, with Afonso Henriques as king.
In 1290, king Dinis created the first Portuguese university in Lisbon (the Estudos Gerais, later moved to Coimbra) and decreed that Portuguese, then simply called the "common language" should be known as the Portuguese language and used officially.
In the second period of Old Portuguese, from the 14th to the 16th century, with the Portuguese discoveries, the language was taken to many regions of Asia, Africa and the Americas (nowadays, the great majority of Portuguese speakers live in Brazil, in South America).
By the 16th century it had become a lingua franca in Asia and Africa, used not only for colonial administration and trade but also for communication between local officials and Europeans of all nationalities.
Its spread was helped by mixed marriages between Portuguese and local people, and by its association with Roman Catholic missionary efforts, which led to the formation of a creole language called Kristang in many parts of Asia (from the word cristão, "Christian").
The language continued to be popular in parts of Asia until the 19th century.
Some Portuguese-speaking Christian communities in India, Sri Lanka, Malaysia, and Indonesia preserved their language even after they were isolated from Portugal.
The end of the Old Portuguese period was marked by the publication of the Cancioneiro Geral by Garcia de Resende, in 1516.
The early times of Modern Portuguese, which spans from the 16th century to present day, were characterized by an increase in the number of learned words borrowed from Classical Latin and Classical Greek since the Renaissance, which greatly enriched the lexicon.
Characterization
A distinctive feature of Portuguese is that it preserved the stressed vowels of Vulgar Latin, which became diphthongs in other Romance languages; cf. Fr. pierre, Sp. piedra, It. pietra, Port. pedra, from Lat. petra; or Sp. fuego, It. fuoco, Port. fogo, from Lat. focum.
Another characteristic of early Portuguese was the loss of intervocalic l and n, sometimes followed by the merger of the two surrounding vowels, or by the insertion of an epenthetic vowel between them: cf. Lat. salire, tenere, catena, Sp. salir, tener, cadena, Port. sair, ter, cadeia.
When the elided consonant was n, it often nasalized the preceding vowel: cf. Lat. manum, rana, bonum, Port. mão, rãa, bõo (now mão, rã, bom).
This process was the source of most of the nasal diphthongs which are typical of Portuguese.
In particular, the Latin endings -anem, -anum and -onem became -ão in most cases, cf. Lat. canem, germanum, rationem with Modern Port. cão, irmão, razão, and their plurals -anes, -anos, -ones normally became -ães, -ãos, -ões, cf. cães, irmãos, razões.
Movement to make Portuguese an official language of the UN
There is a growing number of people in the Portuguese speaking media and the internet who are presenting the case to the CPLP and other organizations to run a debate in the Lusophone community with the purpose of bringing forward a petition to make Portuguese an official language of the United Nations.
In October 2005, during the international Convention of the Elos Club International that took place in Tavira, Portugal a petition was written and unanimously approved whose text can be found on the internet with the title Petição Para Tornar Oficial o Idioma Português na ONU.
Romulo Alexandre Soares, president of the Brazil-Portugal Chamber highlights that the positioning of Brazil in the international arena as one of the emergent powers of the 21 century, the size of its population, and the presence of the language around the world provides legitimacy and justifies a petition to the UN to make the Portuguese an official language at the UN.
Vocabulary
Most of the lexicon of Portuguese is derived from Latin.
Nevertheless, because of the Moorish occupation of the Iberian Peninsula during the Middle Ages, and the participation of Portugal in the Age of Discovery, it has adopted loanwords from all over the world.
Very few Portuguese words can be traced to the pre-Roman inhabitants of Portugal, which included the Gallaeci, Lusitanians, Celtici and Cynetes.
The Phoenicians and Carthaginians, briefly present, also left some scarce traces.
Some notable examples are abóbora "pumpkin" and bezerro "year-old calf", from the nearby Celtiberian language (probably through the Celtici); cerveja "beer", from Celtic; saco "bag", from Phoenician; and cachorro "dog, puppy", from Basque.
In the 5th century, the Iberian Peninsula (the Roman Hispania) was conquered by the Germanic Suevi and Visigoths.
As they adopted the Roman civilization and language, however, these people contributed only a few words to the lexicon, mostly related to warfare — such as espora "spur", estaca "stake", and guerra "war", from Gothic *spaúra, *stakka, and *wirro, respectively.
Between the 9th and 15th centuries Portuguese acquired about 1000 words from Arabic by influence of Moorish Iberia.
They are often recognizable by the initial Arabic article a(l)-, and include many common words such as aldeia "village" from الضيعة aldaya, alface "lettuce" from الخس alkhass, armazém "warehouse" from المخزن almahazan, and azeite "olive oil" from زيت azzait.
From Arabic came also the grammatically peculiar word oxalá "hopefully".
The Mozambican currency name metical was derived from the word مطقال miṭqāl, a unit of weight.
The word Mozambique itself is from the Arabic name of sultan Muça Alebique (Musa Alibiki).
The name of the Portuguese town of Fátima comes from the name of one of the daughters of the prophet Muhammad.
Starting in the 15th century, the Portuguese maritime explorations led to the introduction of many loanwords from Asian languages.
For instance, catana "cutlass" from Japanese katana; corja "rabble" from Malay kórchchu; and chá "tea" from Chinese ''''chá''''.
From South America came batata "potato", from Taino; ananás and abacaxi, from Tupi-Guarani naná and Tupi ibá cati, respectively (two species of pineapple), and tucano "toucan" from Guarani tucan.
See List of Brazil state name etymologies, for some more examples.
From the 16th to the 19th century, the role of Portugal as intermediary in the Atlantic slave trade, with the establishment of large Portuguese colonies in Angola, Mozambique, and Brazil, Portuguese got several words of African and Amerind origin, especially names for most of the animals and plants found in those territories.
While those terms are mostly used in the former colonies, many became current in European Portuguese as well.
From Kimbundu, for example, came kifumate → cafuné "head caress", kusula → caçula "youngest child", marimbondo "tropical wasp", and kubungula → bungular "to dance like a wizard".
Finally, it has received a steady influx of loanwords from other European languages.
For example, melena "hair lock", fiambre "wet-cured ham" (in contrast with presunto "dry-cured ham" from Latin prae-exsuctus "dehydrated"), and castelhano "Castilian", from Spanish; colchete/crochê "bracket"/"crochet", paletó "jacket", batom "lipstick", and filé/filete "steak"/"slice" respectively, from French crochet, paletot, bâton, filet; macarrão "pasta", piloto "pilot", carroça "carriage", and barraca "barrack", from Italian maccherone, pilota, carrozza, baracca; and bife "steak", futebol, revólver, estoque, folclore, from English beef, football, revolver, stock, folklore.
Classification and related languages
Portuguese belongs to the West Iberian branch of the Romance languages, and it has special ties with the following members of this group:
Galician and the Fala, its closest relatives.
See below.
Spanish, the major language closest to Portuguese.
(See also Differences between Spanish and Portuguese.)
Mirandese, another West Iberian language spoken in Portugal.
Judeo-Portuguese and Judeo-Spanish, languages spoken by Sephardic Jews, which remained close to Portuguese and Spanish.
Despite the obvious lexical and grammatical similarities between Portuguese and other Romance languages, it is not mutually intelligible with most of them.
Apart from Galician, Portuguese speakers will usually need some formal study of basic grammar and vocabulary, before attaining a reasonable level of comprehension of those languages, and vice-versa.
Galician and the Fala
The closest language to Portuguese is Galician, spoken in the autonomous community of Galicia (northwestern Spain).
The two were at one time a single language, known today as Galician-Portuguese, but since the political separation of Portugal from Galicia they have diverged somewhat, especially in pronunciation and vocabulary.
Nevertheless, the core vocabulary and grammar of Galician are still noticeably closer to Portuguese than to Spanish.
In particular, like Portuguese, it uses the future subjunctive, the personal infinitive, and the synthetic pluperfect (see the section on the grammar of Portuguese, below).
Mutual intelligibility (estimated at 85% by R. A. Hall, Jr., 1989) is good between Galicians and northern Portuguese, but poorer between Galicians and speakers from central Portugal.
The Fala language is another descendant of Galician-Portuguese, spoken by a small number of people in the Spanish towns of Valverdi du Fresnu, As Ellas and Sa Martín de Trebellu (autonomous community of Extremadura, near the border with Portugal).
Influence on other languages
Many languages have borrowed words from Portuguese, such as Indonesian, Sri Lankan Tamil and Sinhalese (see Sri Lanka Indo-Portuguese), Malay, Bengali, English, Hindi, Konkani, Marathi, Tetum, Xitsonga, Papiamentu, Japanese, Bajan Creole (Spoken in Barbados), Lanc-Patuá (spoken in northern Brazil) and Sranan Tongo (spoken in Suriname).
It left a strong influence on the língua brasílica, a Tupi-Guarani language which was the most widely spoken in Brazil until the 18th century, and on the language spoken around Sikka in Flores Island, Indonesia.
In nearby Larantuka, Portuguese is used for prayers in Holy Week rituals.
The Japanese-Portuguese dictionary Nippo Jisho (1603) was the first dictionary of Japanese in a European language, a product of Jesuit missionary activity in Japan.
Building on the work of earlier Portuguese missionaries, the Dictionarium Anamiticum, Lusitanum et Latinum (Annamite-Portuguese-Latin dictionary) of Alexandre de Rhodes (1651) introduced the modern orthography of Vietnamese, which is based on the orthography of 17th-century Portuguese.
The Romanization of Chinese was also influenced by the Portuguese language (among others), particularly regarding Chinese surnames; one example is Mei.
See also List of English words of Portuguese origin, Loan words in Indonesian, Japanese words of Portuguese origin, Borrowed words in Malay, Sinhala words of Portuguese origin, Loan words from Portuguese in Sri Lankan Tamil.
Derived languages
Beginning in the 16th century, the extensive contacts between Portuguese travelers and settlers, African slaves, and local populations led to the appearance of many pidgins with varying amounts of Portuguese influence.
As these pidgins became the mother tongue of succeeding generations, they evolved into fully fledged creole languages, which remained in use in many parts of Asia and Africa until the 18th century.
Some Portuguese-based or Portuguese-influenced creoles are still spoken today, by over 3 million people worldwide, especially people of partial Portuguese ancestry.
Phonology
There is a maximum of 9 oral vowels and 19 consonants, though some varieties of the language have fewer phonemes (Brazilian Portuguese has only 8 oral vowel phones).
There are also five nasal vowels, which some linguists regard as allophones of the oral vowels, ten oral diphthongs, and five nasal diphthongs.
Vowels
To the seven vowels of Vulgar Latin, European Portuguese has added two near central vowels, one of which tends to be elided in rapid speech, like the e caduc of French (represented either as , or , or ).
The high vowels and the low vowels are four distinct phonemes, and they alternate in various forms of apophony.
Like Catalan, Portuguese uses vowel quality to contrast stressed syllables with unstressed syllables: isolated vowels tend to be raised, and in some cases centralized, when unstressed.
Nasal diphthongs occur mostly at the end of words.
Consonants
The consonant inventory of Portuguese is fairly conservative.
The medieval affricates , , , merged with the fricatives , , , , respectively, but not with each other, and there were no other significant changes to the consonant phonemes since then.
However, some remarkable dialectal variants and allophones have appeared, among which:
In many regions of Brazil, and have the affricate allophones and , respectively, before and .
(Quebec French has a similar phenomenon, with alveolar affricates instead of postalveolars.
Japanese is another example).
At the end of a syllable, the phoneme has the allophone in Brazilian Portuguese (L-vocalization).
In many parts of Brazil and Angola, intervocalic is pronounced as a nasalized palatal approximant which nasalizes the preceding vowel, so that for instance is pronounced .
In most of Brazil, the alveolar sibilants and occur in complementary distribution at the end of syllables, depending on whether the consonant that follows is voiceless or voiced, as in English.
But in most of Portugal and parts of Brazil sibilants are postalveolar at the end of syllables, before voiceless consonants, and before voiced consonants (in Judeo-Spanish, is often replaced with at the end of syllables, too).
There is considerable dialectal variation in the value of the rhotic phoneme .
See Guttural R in Portuguese, for details.
Grammar
A particularly interesting aspect of the grammar of Portuguese is the verb.
Morphologically, more verbal inflections from classical Latin have been preserved by Portuguese than any other major Romance language.
See Romance copula, for a detailed comparison.
It has also some innovations not found in other Romance languages (except Galician and the Fala):
The present perfect tense has an iterative sense unique among the Romance languages.
It denotes an action or a series of actions which began in the past and are expected to keep repeating in the future.
For instance, the sentence Tenho tentado falar com ela would be translated to "I have been trying to talk to her", not "I have tried to talk to her".
On the other hand, the correct translation of the question "Have you heard the latest news?" is not *Tem ouvido a última notícia?, but Ouviu a última notícia?, since no repetition is implied.
The future subjunctive tense, which was developed by medieval West Iberian Romance, but has now fallen into disuse in Spanish, is still used in vernacular Portuguese.
It appears in dependent clauses that denote a condition which must be fulfilled in the future, so that the independent clause will occur.
Other languages normally employ the present tense under the same circumstances:
Se for eleito presidente, mudarei a lei.
If I am elected president, I will change the law.
Quando fores mais velho, vais entender.
When you are older, you will understand.
The personal infinitive: infinitives can inflect according to their subject in person and number, often showing who is expected to perform a certain action; cf. É melhor voltares "It is better [for you] to go back," É melhor voltarmos "It is better [for us] to go back."
Perhaps for this reason, infinitive clauses replace subjunctive clauses more often in Portuguese than in other Romance languages.
Writing system
Portuguese is written with the Latin alphabet, making use of five diacritics to denote stress, vowel height, contraction, nasalization, and other sound changes (acute accent, grave accent, circumflex accent, tilde, and cedilla).
Brazilian Portuguese also uses the diaeresis mark.
Accented characters and digraphs are not counted as separate letters for collation purposes.
Brazilian vs. European spelling
There are some minor differences between the orthographies of Brazil and other Portuguese language countries.
One of the most pervasive is the use of acute accents in the European/African/Asian orthography in many words such as sinónimo, where the Brazilian orthography has a circumflex accent, sinônimo.
Another important difference is that Brazilian spelling often lacks c or p before c, ç, or t, where the European orthography has them; for example, cf. Brazilian fato with European facto, "fact", or Brazilian objeto with European objecto, "object".
Some of these spelling differences reflect differences in the pronunciation of the words, but others are merely graphic.
Examples
Excerpt from the Portuguese national epic Os Lusíadas, by author Luís de Camões (I, 33)
Predictive analytics
Predictive analytics encompasses a variety of techniques from statistics and data mining that analyze current and historical data to make predictions about future events.
Such predictions rarely take the form of absolute statements, and are more likely to be expressed as values that correspond to the odds of a particular event or behavior taking place in the future.
In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities.
Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.
One of the most well-known applications is credit scoring, which is used throughout financial services.
Scoring models process a customer’s credit history, loan application, customer data, etc., in order to rank-order individuals by their likelihood of making future credit payments on time.
Predictive analytics are also used in insurance, telecommunications, retail, travel, healthcare, pharmaceuticals and other fields.
Types of predictive analytics
Generally, predictive analytics is used to mean predictive modeling, scoring of predictive models, and forecasting.
However, people are increasingly using the term to describe related analytic disciplines, such as descriptive modeling and decision modeling or optimization.
These disciplines also involve rigorous data analysis, and are widely used in business for segmentation and decision making, but have different purposes and the statistical techniques underlying them vary.
Predictive models
Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in the future in order to improve marketing effectiveness.
This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models.
Predictive models often perform calculations during live transactions, for example, to evaluate the risk or opportunity of a given customer or transaction, in order to guide a decision.
Descriptive models
Descriptive models “describe” relationships in data in a way that is often used to classify customers or prospects into groups.
Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products.
But the descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do.
Descriptive models are often used “offline,” for example, to categorize customers by their product preferences and life stage.
Descriptive modeling tools can be utilized to develop agent based models that can simulate large number of individualized agents to predict possible futures.
Decision models
Decision models describe the relationship between all the elements of a decision — the known data (including results of predictive models), the decision and the forecast results of the decision — in order to predict the results of decisions involving many variables.
These models can be used in optimization, a data-driven approach to improving decision logic that involves maximizing certain outcomes while minimizing others.
Decision models are generally used offline, to develop decision logic or a set of business rules that will produce the desired action for every customer or circumstance.
Predictive analytics
Definition
Predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns.
The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes.
Current uses
Although predictive analytics can be put to use in many applications, we outline a few examples where predictive analytics has shown positive impact in recent years.
Analytical Customer Relationship Management (CRM)
Analytical Customer Relationship Management is a frequent commercial application of Predictive Analysis.
Methods of predictive analysis are applied to customer data to pursue CRM objectives.
Direct marketing
Product marketing is constantly faced with the challenge of coping with the increasing number of competing products, different consumer preferences and the variety of methods (channels) available to interact with each consumer.
Efficient marketing is a process of understanding the amount of variability and tailoring the marketing strategy for greater profitability.
Predictive analytics can help identify consumers with a higher likelihood of responding to a particular marketing offer.
Models can be built using data from consumers’ past purchasing history and past response rates for each channel.
Additional information about the consumers demographic, geographic and other characteristics can be used to make more accurate predictions.
Targeting only these consumers can lead to substantial increase in response rate which can lead to a significant reduction in cost per acquisition.
Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of products and marketing channels that should be used to target a given consumer.
Cross-sell
Often corporate organizations collect and maintain abundant data (e.g. customer records, sale transactions) and exploiting hidden relationships in the data can provide a competitive advantage to the organization.
For an organization that offers multiple products, an analysis of existing customer behavior can lead to efficient cross sell of products.
This directly leads to higher profitability per customer and strengthening of the customer relationship.
Predictive analytics can help analyze customers’ spending, usage and other behavior, and help cross-sell the right product at the right time.
Customer retention
With the amount of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction.
In such a competitive scenario, consumer loyalty needs to be rewarded and customer attrition needs to be minimized.
Businesses tend to respond to customer attrition on a reactive basis, acting only after the customer has initiated the process to terminate service.
At this stage, the chance of changing the customer’s decision is almost impossible.
Proper application of predictive analytics can lead to a more proactive retention strategy.
By a frequent examination of a customer’s past service usage, service performance, spending and other behavior patterns, predictive models can determine the likelihood of a customer wanting to terminate service sometime in the near future.
An intervention with lucrative offers can increase the chance of retaining the customer.
Silent attrition is the behavior of a customer to slowly but steadily reduce usage and is another problem faced by many companies.
Predictive analytics can also predict this behavior accurately and before it occurs, so that the company can take proper actions to increase customer activity.
Underwriting
Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk.
For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver.
A financial company needs to assess a borrower’s potential and ability to pay before granting a loan.
For a health insurance provider, predictive analytics can analyze a few years of past medical claims data, as well as lab, pharmacy and other records where available, to predict how expensive an enrollee is likely to be in the future.
Predictive analytics can help underwriting of these quantities by predicting the chances of illness, default, bankruptcy, etc.
Predictive analytics can streamline the process of customer acquisition, by predicting the future risk behavior of a customer using application level data.
Proper predictive analytics can lead to proper pricing decisions, which can help mitigate future risk of default.
Collection analytics
Every portfolio has a set of delinquent customers who do not make their payments on time.
The financial institution has to undertake collection activities on these customers to recover the amounts due.
A lot of collection resources are wasted on customers who are difficult or impossible to recover.
Predictive analytics can help optimize the allocation of collection resources by identifying the most effective collection agencies, contact strategies, legal actions and other strategies to each customer, thus significantly increasing recovery at the same time reducing collection costs.
Fraud detection
Fraud is a big problem for many businesses and can be of various types.
Inaccurate credit applications, fraudulent transactions, identity thefts and false insurance claims are some examples of this problem.
These problems plague firms all across the spectrum and some examples of likely victims are credit card issuers, insurance companies, retail merchants, manufacturers, business to business suppliers and even services providers.
This is an area where a predictive model is often used to help weed out the “bads” and reduce a business's exposure to fraud.
Portfolio, product or economy level prediction
Often the focus of analysis is not the consumer but the product, portfolio, firm, industry or even the economy.
For example a retailer might be interested in predicting store level demand for inventory management purposes.
Or the Federal Reserve Board might be interested in predicting the unemployment rate for the next year.
These type of problems can be addressed by predictive analytics using Time Series techniques (see below).
Wrong Information....
Statistical techniques
The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.
Regression Techniques
Regression models are the mainstay of predictive analytics.
The focus lies on establishing a mathematical equation as a model to represent the interactions between the different variables in consideration.
Depending on the situation, there is a wide variety of models that can be applied while performing predictive analytics.
Some of them are briefly discussed below.
Linear Regression Model
The linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables.
This relationship is expressed as an equation that predicts the response variable as a linear function of the parameters.
These parameters are adjusted so that a measure of fit is optimized.
Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions.
The goal of regression is to select the parameters of the model so as to minimize the sum of the squared residuals.
This is referred to as ordinary least squares (OLS) estimation and results in best linear unbiased estimates (BLUE) of the parameters if and only if the Gauss-Markowitz assumptions are satisfied.
Once the model has been estimated we would be interested to know if the predictor variables belong in the model – i.e. is the estimate of each variable’s contribution reliable?
To do this we can check the statistical significance of the model’s coefficients which can be measured using the t-statistic.
This amounts to testing whether the coefficient is significantly different from zero.
How well the model predicts the dependent variable based on the value of the independent variables can be assessed by using the R² statistic.
It measures predictive power of the model i.e. the proportion of the total variation in the dependent variable that is “explained” (accounted for) by variation in the independent variables.
Discrete choice models
Multivariate regression (above) is generally used when the response variable is continuous and has an unbounded range.
Often the response variable may not be continuous but rather discrete.
While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, some of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques such as discrete choice models which are better suited for this type of analysis.
If the dependent variable is discrete, some of those superior methods are logistic regression, multinomial logit and probit models.
Logistic regression and probit models are used when the dependent variable is binary.
Logistic regression
In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allison’s Logistic Regression for more information on the theory of Logistic Regression).
The Wald and likelihood-ratio test are used to test the statistical significance of each coefficient b in the model (analogous to the t tests used in OLS regression; see above).
A test assessing the goodness-of-fit of a classification model is the Hosmer and Lemeshow test.
Multinomial logistic regression
An extension of the binary logit model to cases where the dependent variable has more than 2 categories is the multinomial logit model.
In such cases collapsing the data into two categories might not make good sense or may lead to loss in the richness of the data.
The multinomial logit model is the appropriate technique in these cases, especially when the dependent variable categories are not ordered (for examples colors like red, blue, green).
Some authors have extended multinomial regression to include feature selection/importance methods such as Random multinomial logit.
Probit regression
Probit models offer an alternative to logistic regression for modeling categorical dependent variables.
Even though the outcomes tend to be similar, the underlying distributions are different.
Probit models are popular in social sciences like economics.
A good way to understand the key difference between probit and logit models, is to assume that there is a latent variable z.
We do not observe z but instead observe y which takes the value 0 or 1.
In the logit model we assume that follows a logistic distribution.
In the probit model we assume that follows a standard normal distribution.
Note that in social sciences (example economics), probit is often used to model situations where the observed variable y is continuous but takes values between 0 and 1.
Logit vs. Probit
The Probit model has been around longer than the logit model.
They look identical, except that the logistic distribution tends to be a little flat tailed.
In fact one of the reasons the logit model was formulated was that the probit model was extremely hard to compute because it involved calculating difficult integrals.
Modern computing however has made this computation fairly simple.
The coefficients obtained from the logit and probit model are also fairly close.
However the odds ratio makes the logit model easier to interpret.
For practical purposes the only reasons for choosing the probit model over the logistic model would be:
There is a strong belief that the underlying distribution is normal
The actual event is not a binary outcome (e.g. Bankrupt/not bankrupt) but a proportion (e.g. Proportion of population at different debt levels).
Time series models
Time series models are used for predicting or forecasting the future behavior of variables.
These models account for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for.
As a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend, seasonal and cyclical component of the series.
Modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future.
Time series models estimate difference equations containing stochastic components.
Two commonly used forms of these models are autoregressive models (AR) and moving average (MA) models.
The Box-Jenkins methodology (1976) developed by George Box and G.M. Jenkins combines the AR and MA models to produce the ARMA (autoregressive moving average) model which is the cornerstone of stationary time series analysis.
ARIMA (autoregressive integrated moving average models) on the other hand are used to describe non-stationary time series.
Box and Jenkins suggest differencing a non stationary time series to obtain a stationary series to which an ARMA model can be applied.
Non stationary time series have a pronounced trend and do not have a constant long-run mean or variance.
Box and Jenkins proposed a three stage methodology which includes: model identification, estimation and validation.
The identification stage involves identifying if the series is stationary or not and the presence of seasonality by examining plots of the series, autocorrelation and partial autocorrelation functions.
In the estimation stage, models are estimated using non-linear time series or maximum likelihood estimation procedures.
Finally the validation stage involves diagnostic checking such as plotting the residuals to detect outliers and evidence of model fit.
In recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as ARCH (autoregressive conditional heteroskedasticity) and GARCH (generalized autoregressive conditional heteroskedasticity) models frequently used for financial time series.
In addition time series models are also used to understand inter-relationships among economic variables represented by systems of equations using VAR (vector autoregression) and structural VAR models.
Survival or duration analysis
Survival analysis is another name for time to event analysis.
These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social sciences like economics, as well as in engineering (reliability and failure time analysis).
Censoring and non-normality which are characteristic of survival data generate difficulty when trying to analyze the data using conventional statistical models such as multiple linear regression.
The Normal distribution, being a symmetric distribution, takes positive as well as negative values, but duration by its very nature cannot be negative and therefore normality cannot be assumed when dealing with duration/survival data.
Hence the normality assumption of regression models is violated.
A censored observation is defined as an observation with incomplete information.
Censoring introduces distortions into traditional statistical methods and is essentially a defect of the sample data.
The assumption is that if the data were not censored it would be representative of the population of interest.
In survival analysis, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time.
An important concept in survival analysis is the hazard rate.
The hazard rate is defined as the probability that the event will occur at time t conditional on surviving until time t.
Another concept related to the hazard rate is the survival function which can be defined as the probability of surviving to time t.
Most models try to model the hazard rate by choosing the underlying distribution depending on the shape of the hazard function.
A distribution whose hazard function slopes upward is said to have positive duration dependence, a decreasing hazard shows negative duration dependence whereas constant hazard is a process with no memory usually characterized by the exponential distribution.
Some of the distributional choices in survival models are: F, gamma, Weibull, log normal, inverse normal, exponential etc.
All these distributions are for a non-negative random variable.
Duration models can be parametric, non-parametric or semi-parametric.
Some of the models commonly used are Kaplan-Meier, Cox proportional hazard model (non parametric).
Classification and regression trees
Classification and regression trees (CART) is a non-parametric technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.
Trees are formed by a collection of rules based on values of certain variables in the modeling data set
Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable
Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure)
Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met
Each branch of the tree ends in a terminal node
Each observation falls into one and exactly one terminal node
Each terminal node is uniquely defined by a set of rules
A very popular method for predictive analytics is Leo Breiman's Random forests or derived versions of this technique like Random multinomial logit.
Multivariate adaptive regression splines
Multivariate adaptive regression splines (MARS) is a non-parametric technique that builds flexible models by fitting piecewise linear regressions.
An important concept associated with regression splines is that of a knot.
Knot is where one local regression model gives way to another and thus is the point of intersection between two splines.
In multivariate and adaptive regression splines, basis functions are the tool used for generalizing the search for knots.
Basis functions are a set of functions used to represent the information contained in one or more variables.
Multivariate and Adaptive Regression Splines model almost always creates the basis functions in pairs.
Multivariate and adaptive regression spline approach deliberately overfits the model and then prunes to get to the optimal model.
The algorithm is computationally very intensive and in practice we are required to specify an upper limit on the number of basis functions.
Machine learning techniques
Machine learning, a branch of artificial intelligence, was originally employed to develop techniques to enable computers to learn.
Today, since it includes a number of advanced statistical methods for regression and classification, it finds application in a wide variety of fields including medical diagnostics, credit card fraud detection, face and speech recognition and analysis of the stock market.
In certain applications it is sufficient to directly predict the dependent variable without focusing on the underlying relationships between variables.
In other cases, the underlying relationships can be very complex and the mathematical form of the dependencies unknown.
For such cases, machine learning techniques emulate human cognition and learn from training examples to predict future events.
A brief discussion of some of these methods used commonly for predictive analytics is provided below.
A detailed study of machine learning can be found in Mitchell (1997).
Neural networks
Neural networks are nonlinear sophisticated modeling techniques that are able to model complex functions.
They can be applied to problems of prediction, classification or control in a wide spectrum of fields such as finance, cognitive psychology/neuroscience, medicine, engineering, and physics.
Neural networks are used when the exact nature of the relationship between inputs and output is not known.
A key feature of neural networks is that they learn the relationship between inputs and output through training.
There are two types of training in neural networks used by different networks, supervised and unsupervised training, with supervised being the most common one.
Some examples of neural network training techniques are backpropagation, quick propagation, conjugate gradient descent, projection operator, Delta-Bar-Delta etc.
Theses are applied to network architectures such as multilayer perceptrons, Kohonen networks, Hopfield networks, etc.
Radial basis functions
A radial basis function (RBF) is a function which has built into it a distance criterion with respect to a center.
Such functions can be used very efficiently for interpolation and for smoothing of data.
Radial basis functions have been applied in the area of neural networks where they are used as a replacement for the sigmoidal transfer function.
Such networks have 3 layers, the input layer, the hidden layer with the RBF non-linearity and a linear output layer.
The most popular choice for the non-linearity is the Gaussian.
RBF networks have the advantage of not being locked into local minima as do the feed-forward networks such as the multilayer perceptron.
Support vector machines
Support Vector Machines (SVM) are used to detect and exploit complex patterns in data by clustering, classifying and ranking the data.
They are learning machines that are used to perform binary classifications and regression estimations.
They commonly use kernel based methods to apply linear classification techniques to non-linear classification problems.
There are a number of types of SVM such as linear, polynomial, sigmoid etc.
Naïve Bayes
Naïve Bayes based on Bayes conditional probability rule is used for performing classification tasks.
Naïve Bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret.
It is best employed when faced with the problem of ‘curse of dimensionality’ i.e. when the number of predictors is very high.
k-nearest neighbours
The nearest neighbour algorithm (KNN) belongs to the class of pattern recognition statistical methods.
The method does not impose a priori any assumptions about the distribution from which the modeling sample is drawn.
It involves a training set with both positive and negative values.
A new sample is classified by calculating the distance to the nearest neighbouring training case.
The sign of that point will determine the classification of the sample.
In the k-nearest neighbour classifier, the k nearest points are considered and the sign of the majority is used to classify the sample.
The performance of the kNN algorithm is influenced by three main factors: (1) the distance measure used to locate the nearest neighbours; (2) the decision rule used to derive a classification from the k-nearest neighbours; and (3) the number of neighbours used to classify the new sample.
It can be proved that, unlike other methods, this method is universally asymptotically convergent, i.e.: as the size of the training set increases, if the observations are iid, regardless of the distribution from which the sample is drawn, the predicted class will converge to the class assignment that minimizes misclassification error.
See Devroy et alt.
Popular tools
There are numerous tools available in the marketplace which help with the execution of predictive analytics.
These range from those which need very little user sophistication to those that are designed for the expert practitioner.
The difference between these tools is often in the level of customization and heavy data lifting allowed.
For traditional statistical modeling some of the popular tools are DAP/SAS, S-Plus, PSPP/SPSS and Stata.
For machine learning/data mining type of applications, KnowledgeSEEKER, KnowledgeSTUDIO, Enterprise Miner, GeneXproTools, Viscovery, Clementine, KXEN Analytic Framework, InforSense and Excel Miner are some of the popularly used options.
Classification Tree analysis can be performed using CART software.
SOMine is a predictive analytics tool based on self-organizing maps (SOMs) available from Viscovery Software.
R is a very powerful tool that can be used to perform almost any kind of statistical analysis, and is freely downloadable.
WEKA is a freely available open-source collection of machine learning methods for pattern classification, regression, clustering, and some types of meta-learning, which can be used for predictive analytics.
RapidMiner is another freely available integrated open-source software environment for predictive analytics, data mining, and machine learning fully integrating WEKA and providing an even larger number of methods for predictive analytics.
Recently, in an attempt to provide a standard language for expressing predictive models, the Predictive Model Markup Language (PMML) has been proposed.
Such an XML-based language provides a way for the different tools to define predictive models and to share these between PMML compliant applications.
Several tools already produce or consume PMML documents, these include ADAPA, IBM DB2 Warehouse, CART, SAS Enterprise Miner, and SPSS.
Predictive analytics has also found its way into the IT lexicon, most notably in the area of IT Automation.
Vendors such as Stratavia and their Data Palette product offer predictive analytics as part of their automation platform, predicting how resources will behave in the future and automate the environment accordingly.
The widespread use of predictive analytics in industry has led to the proliferation of numerous productized solutions firms.
Some of them are highly specialized (focusing, for example, on fraud detection, automatic saleslead generation or response modeling) in a specific domain (Fair Isaac for credit card scores) or industry verticals (MarketRx in Pharmaceutical).
Others provide predictive analytics services in support of a wide range of business problems across industry verticals (Fifth C).
Predictive Analytics competitions are also fairly common and often pit academics and Industry practitioners (see for example, KDD CUP).
Conclusion
Predictive analytics adds great value to a businesses decision making capabilities by allowing it to formulate smart policies on the basis of predictions of future outcomes.
A broad range of tools and techniques are available for this type of analysis and their selection is determined by the analytical maturity of the firm as well as the specific requirements of the problem being solved.
Education
Predictive analytics is taught at the following institutions:
Ghent University, Belgium: Master of Marketing Analysis, an 8-month advanced master degree taught in English with strong emphasis on applications of predictive analytics in Analytical CRM.
RapidMiner
RapidMiner (formerly YALE (Yet Another Learning Environment)) is an environment for machine learning and data mining experiments.
It allows experiments to be made up of a large number of arbitrarily nestable operators, described in XML files which can easily be created with RapidMiner's graphical user interface.
Applications of RapidMiner cover both research and real-world data mining tasks.
The initial version has been developed by the Artificial Intelligence Unit of University of Dortmund since 2001.
It is distributed under a GNU license, and has been hosted by SourceForge since 2004.
RapidMiner provides more than 400 operators for all main machine learning procedures, including input and output, and data preprocessing and visualization.
It is written in the Java programming language and therefore can work on all popular operating systems.
It also integrates all learning schemes and attribute evaluators of the Weka learning environment.
Properties
Some properties of RapidMiner are:
written in Java
knowledge discovery processes are modeled as operator trees
internal XML representation ensures standardized interchange format of data mining experiments
scripting language allows for automatic large-scale experiments
multi-layered data view concept ensures efficient and transparent data handling
graphical user interface, command line mode (batch mode), and Java API for using RapidMiner from your own programs
plugin and extension mechanisms, several plugins already exist
plotting facility offering a large set of high-dimensional visualization schemes for data and models
applications include text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining.
Russian language
Russian ([[:Media:Ru-russkiy jizyk.ogg|]] (help•info), transliteration: , ) is the most geographically widespread language of Eurasia, the most widely spoken of the Slavic languages, and the largest native language in Europe.
Russian belongs to the family of Indo-European languages and is one of three (or, according to some authorities , four) living members of the East Slavic languages, the others being Belarusian and Ukrainian (and possibly Rusyn, often considered a dialect of Ukrainian).
It is also spoken by the countries of the Russophone.
Written examples of Old East Slavonic are attested from the 10th century onwards.
Today Russian is widely used outside Russia.
It is applied as a means of coding and storage of universal knowledge — 60–70% of all world information is published in English and Russian languages.
Over a quarter of the world's scientific literature is published in Russian.
Russian is also a necessary accessory of world communications systems (broadcasts, air- and space communication, etc).
Due to the status of the Soviet Union as a superpower, Russian had great political importance in the 20th century.
Hence, the language is one of the official languages of the United Nations.
Russian distinguishes between consonant phonemes with palatal secondary articulation and those without, the so-called soft and hard sounds.
This distinction is found between pairs of almost all consonants and is one of the most distinguishing features of the language.
Another important aspect is the reduction of unstressed vowels, which is somewhat similar to that of English.
Stress, which is unpredictable, is not normally indicated orthographically.
According to the Institute of Russian Language of the Russian Academy of Sciences, an optional acute accent () may, and sometimes should, be used to mark stress.
For example, it is used to distinguish between otherwise identical words, especially when context doesn't make it obvious: замо́к/за́мок (lock/castle), сто́ящий/стоя́щий (worthwhile/standing), чудно́/чу́дно (this is odd/this is marvellous), молоде́ц/мо́лодец (attaboy/fine young man), узна́ю/узнаю́ (I shall learn it/I am learning it), отреза́ть/отре́зать (infinitive for "cut"/perfective for "cut"); to indicate the proper pronouncation of uncommon words, especially personal and family names (афе́ра, гу́ру, Гарси́а, Оле́ша, Фе́рми), and to express the stressed word in the sentence (Ты́ съел печенье?/Ты съе́л печенье?/Ты съел пече́нье? - Was it you who eat the cookie?/Did you eat the cookie?/Was the cookie your meal?).
Acute accents are mandatory in lexical dictionaries and books intended to be used either by children or foreign readers.
Classification
Russian is a Slavic language in the Indo-European family.
From the point of view of the spoken language, its closest relatives are Ukrainian and Belarusian, the other two national languages in the East Slavic group.
In many places in eastern Ukraine and Belarus, these languages are spoken interchangeably, and in certain areas traditional bilingualism resulted in language mixture, e.g. Surzhyk in eastern Ukraine and Trasianka in Belarus.
An East Slavic Old Novgorod dialect, although vanished during the fifteenth or sixteenth century, is sometimes considered to have played a significant role in formation of the modern Russian language.
The vocabulary (mainly abstract and literary words), principles of word formation, and, to some extent, inflections and literary style of Russian have been also influenced by Church Slavonic, a developed and partly adopted form of the South Slavic Old Church Slavonic language used by the Russian Orthodox Church.
However, the East Slavic forms have tended to be used exclusively in the various dialects that are experiencing a rapid decline.
In some cases, both the East Slavic and the Church Slavonic forms are in use, with slightly different meanings.
For details, see Russian phonology and History of the Russian language.
Russian phonology and syntax (especially in northern dialects) have also been influenced to some extent by the numerous Finnic languages of the Finno-Ugric subfamily: Merya, Moksha, Muromian, the language of the Meshchera, Veps, et cetera.
These languages, some of them now extinct, used to be spoken in the center and in the north of what is now the European part of Russia.
They came in contact with Eastern Slavic as far back as the early Middle Ages and eventually served as substratum for the modern Russian language.
The Russian dialects spoken north, north-east and north-west of Moscow have a considerable number of words of Finno-Ugric origin.
Over the course of centuries, the vocabulary and literary style of Russian have also been influenced by Turkic/Caucasian/Central Asian languages, as well as Western/Central European languages such as Polish, Latin, Dutch, German, French, and English.
According to the Defense Language Institute in Monterey, California, Russian is classified as a level III language in terms of learning difficulty for native English speakers, requiring approximately 780 hours of immersion instruction to achieve intermediate fluency.
It is also regarded by the United States Intelligence Community as a "hard target" language, due to both its difficulty to master for English speakers as well as due to its critical role in American world policy.
Geographic distribution
Russian is primarily spoken in Russia and, to a lesser extent, the other countries that were once constituent republics of the USSR.
Until 1917, it was the sole official language of the Russian Empire.
During the Soviet period, the policy toward the languages of the various other ethnic groups fluctuated in practice.
Though each of the constituent republics had its own official language, the unifying role and superior status was reserved for Russian.
Following the break-up of 1991, several of the newly independent states have encouraged their native languages, which has partly reversed the privileged status of Russian, though its role as the language of post-Soviet national intercourse throughout the region has continued.
In Latvia, notably, its official recognition and legality in the classroom have been a topic of considerable debate in a country where more than one-third of the population is Russian-speaking, consisting mostly of post-World War II immigrants from Russia and other parts of the former USSR (Belarus, Ukraine).
Similarly, in Estonia, the Soviet-era immigrants and their Russian-speaking descendants constitute 25,6% of the country's current population and 58,6% of the native Estonian population is also able to speak Russian.
In all, 67,8% of Estonia's population can speak Russian.
In Kazakhstan and Kyrgyzstan, Russian remains a co-official language with Kazakh and Kyrgyz respectively.
Large Russian-speaking communities still exist in northern Kazakhstan, and ethnic Russians comprise 25.6 % of Kazakhstan's population.
A much smaller Russian-speaking minority in Lithuania has represented less than 1/10 of the country's overall population.
Nevertheless more than half of the population of the Baltic states are able to hold a conversation in Russian and almost all have at least some familiarity with the most basic spoken and written phrases.
The Russian control of Finland in 1809–1918, however, has left few Russian speakers in Finland.
There are 33,400 Russian speakers in Finland, amounting to 0.6% of the population.
5000 (0.1%) of them are late 19th century and 20th century immigrants, and the rest are recent immigrants, who have arrived in the 90's and later.
In the twentieth century, Russian was widely taught in the schools of the members of the old Warsaw Pact and in other countries that used to be allies of the USSR.
In particular, these countries include Poland, Bulgaria, the Czech Republic, Slovakia, Hungary, Romania, Albania and Cuba.
However, younger generations are usually not fluent in it, because Russian is no longer mandatory in the school system.
It is currently the most widely-taught foreign language in Mongolia.
Russian is also spoken in Israel by at least 750,000 ethnic Jewish immigrants from the former Soviet Union (1999 census).
The Israeli press and websites regularly publish material in Russian.
Sizable Russian-speaking communities also exist in North America, especially in large urban centers of the U.S. and Canada such as New York City, Philadelphia, Boston, Los Angeles, San Francisco, Seattle, Toronto, Baltimore, Miami, Chicago, Denver, and the Cleveland suburb of Richmond Heights.
In the former two, Russian-speaking groups total over half a million.
In a number of locations they issue their own newspapers, and live in their self-sufficient neighborhoods (especially the generation of immigrants who started arriving in the early sixties).
Only about a quarter of them are ethnic Russians, however.
Before the dissolution of the Soviet Union, the overwhelming majority of Russophones in North America were Russian-speaking Jews.
Afterwards the influx from the countries of the former Soviet Union changed the statistics somewhat.
According to the United States 2000 Census, Russian is the primary language spoken in the homes of over 700,000 individuals living in the United States.
Significant Russian-speaking groups also exist in Western Europe.
These have been fed by several waves of immigrants since the beginning of the twentieth century, each with its own flavor of language.
Germany, the United Kingdom, Spain, France, Italy, Belgium, Greece, Brazil, Norway, Austria, and Turkey have significant Russian-speaking communities totaling 3 million people.
Two thirds of them are actually Russian-speaking descendants of Germans, Greeks, Jews, Armenians, or Ukrainians who either repatriated after the USSR collapsed or are just looking for temporary employment.
Recent estimates of the total number of speakers of Russian:
Official status
Russian is the official language of Russia.
It is also an official language of Belarus, Kazakhstan, Kyrgyzstan, an unofficial but widely spoken language in Ukraine and the de facto official language of the unrecognized of Transnistria, South Ossetia and Abkhazia.
Russian is one of the six official languages of the United Nations.
Education in Russian is still a popular choice for both Russian as a second language (RSL) and native speakers in Russia as well as many of the former Soviet republics.
97% of the public school students of Russia, 75% in Belarus, 41% in Kazakhstan, 25% in Ukraine, 23% in Kyrgyzstan, 21% in Moldova, 7% in Azerbaijan, 5% in Georgia and 2% in Armenia and Tajikistan receive their education only or mostly in Russian.
Although the corresponding percentage of ethnic Russians is 78% in Russia, 10% in Belarus, 26% in Kazakhstan, 17% in Ukraine, 9% in Kyrgyzstan, 6% in Moldova, 2% in Azerbaijan, 1.5% in Georgia and less than 1% in both Armenia and Tajikistan.
Russian-language schooling is also available in Latvia, Estonia and Lithuania, but due to education reforms, a number of subjects taught in Russian are reduced at the high school level.
The language has a co-official status alongside Moldovan in the autonomies of Gagauzia and Transnistria in Moldova, and in seven Romanian communes in Tulcea and Constanţa counties.
In these localities, Russian-speaking Lipovans, who are a recognized ethnic minority, make up more than 20% of the population.
Thus, according to Romania's minority rights law, education, signage, and access to public administration and the justice system are provided in Russian alongside Romanian.
In the Autonomous Republic of Crimea in Ukraine, Russian is an officially recognized language alongside with Crimean Tatar, but in reality, is the only language used by the government, thus being a de facto official language.
Dialects
Despite leveling after 1900, especially in matters of vocabulary, a number of dialects exist in Russia.
Some linguists divide the dialects of the Russian language into two primary regional groupings, "Northern" and "Southern", with Moscow lying on the zone of transition between the two.
Others divide the language into three groupings, Northern, Central and Southern, with Moscow lying in the Central region.
Dialectology within Russia recognizes dozens of smaller-scale variants.
The dialects often show distinct and non-standard features of pronunciation and intonation, vocabulary, and grammar.
Some of these are relics of ancient usage now completely discarded by the standard language.
The northern Russian dialects and those spoken along the Volga River typically pronounce unstressed clearly (the phenomenon called okanye/оканье).
East of Moscow, particularly in Ryazan Region, unstressed and following palatalized consonants and preceding a stressed syllable are not reduced to (like in the Moscow dialect), being instead pronounced as in such positions (e.g. несли is pronounced as , not as ) - this is called yakanye/ яканье; many southern dialects have a palatalized final in 3rd person forms of verbs (this is unpalatalized in the standard dialect) and a fricative where the standard dialect has .
However, in certain areas south of Moscow, e.g. in and around Tula, is pronounced as in the Moscow and northern dialects unless it precedes a voiceless plosive or a pause.
In this position is lenited and devoiced to the fricative , e.g. друг (in Moscow's dialect, only Бог , лёгкий , мягкий and some derivatives follow this rule).
Some of these features (e.g. a debuccalized or lenited and palatalized final in 3rd person forms of verbs) are also present in modern Ukrainian, indicating either a linguistic continuum or strong influence one way or the other.
The city of Veliky Novgorod has historically displayed a feature called chokanye/tsokanye (чоканье/цоканье), where and were confused (this is thought to be due to influence from Finnish, which doesn't distinguish these sounds).
So, цапля ("heron") has been recorded as 'чапля'.
Also, the second palatalization of velars did not occur there, so the so-called ě² (from the Proto-Slavonic diphthong *ai) did not cause to shift to ; therefore where Standard Russian has цепь ("chain"), the form кепь is attested in earlier texts.
Among the first to study Russian dialects was Lomonosov in the eighteenth century.
In the nineteenth, Vladimir Dal compiled the first dictionary that included dialectal vocabulary.
Detailed mapping of Russian dialects began at the turn of the twentieth century.
In modern times, the monumental Dialectological Atlas of the Russian Language (Диалектологический атлас русского языка ), was published in 3 folio volumes 1986–1989, after four decades of preparatory work.
The standard language is based on (but not identical to) the Moscow dialect.
Derived languages
Balachka a dialect, spoken primarily by Cossacks, in the regions of Don, Kuban and Terek.
Fenya, a criminal argot of ancient origin, with Russian grammar, but with distinct vocabulary.
Nadsat, the fictional language spoken in 'A Clockwork Orange' uses a lot of Russian words and Russian slang.
Surzhyk is a language with Russian and Ukrainian features, spoken in some areas of Ukraine
Trasianka is a language with Russian and Belarusian features used by a large portion of the rural population in Belarus.
Quelia, a pseudo pidgin of German and Russian.
Runglish, Russian-English pidgin.
This word is also used by English speakers to describe the way in which Russians attempt to speak English using Russian morphology and/or syntax.
Russenorsk is an extinct pidgin language with mostly Russian vocabulary and mostly Norwegian grammar, used for communication between Russians and Norwegian traders in the Pomor trade in Finnmark and the Kola Peninsula.
Writing system
Alphabet
Russian is written using a modified version of the Cyrillic (кириллица) alphabet.
The Russian alphabet consists of 33 letters.
The following table gives their upper case forms, along with IPA values for each letter's typical sound:
Older letters of the Russian alphabet include <>, which merged to <е> (); <і> and <>, which both merged to <и>(); <>, which merged to <ф> (); and <>, which merged to <я> ( or ).
While these older letters have been abandoned at one time or another, they may be used in this and related articles.
The yers <ъ> and <ь> originally indicated the pronunciation of ultra-short or reduced , .
The Russian alphabet has many systems of character encoding.
KOI8-R was designed by the government and was intended to serve as the standard encoding.
This encoding is still used in UNIX-like operating systems.
Nevertheless, the spread of MS-DOS and Microsoft Windows created chaos and ended by establishing different encodings as de-facto standards.
For communication purposes, a number of conversion applications were developed.
"iconv" is an example that is supported by most versions of Linux, Macintosh and some other operating systems.
Most implementations (especially old ones) of the character encoding for the Russian language are aimed at simultaneous use of English and Russian characters only and do not include support for any other language.
Certain hopes for a unification of the character encoding for the Russian alphabet are related to the Unicode standard, specifically designed for peaceful coexistence of various languages, including even dead languages.
Unicode also supports the letters of the Early Cyrillic alphabet, which have many similarities with the Greek alphabet.
Orthography
Russian spelling is reasonably phonemic in practice.
It is in fact a balance among phonemics, morphology, etymology, and grammar; and, like that of most living languages, has its share of inconsistencies and controversial points.
A number of rigid spelling rules introduced between the 1880s and 1910s have been responsible for the latter whilst trying to eliminate the former.
The current spelling follows the major reform of 1918, and the final codification of 1956.
An update proposed in the late 1990s has met a hostile reception, and has not been formally adopted.
The punctuation, originally based on Byzantine Greek, was in the seventeenth and eighteenth centuries reformulated on the French and German models.
Sounds
The phonological system of Russian is inherited from Common Slavonic, but underwent considerable modification in the early historical period, before being largely settled by about 1400.
The language possesses five vowels, which are written with different letters depending on whether or not the preceding consonant is palatalized.
The consonants typically come in plain vs. palatalized pairs, which are traditionally called hard and soft.
(The hard consonants are often velarized, especially before back vowels, although in some dialects the velarization is limited to hard ).
The standard language, based on the Moscow dialect, possesses heavy stress and moderate variation in pitch.
Stressed vowels are somewhat lengthened, while unstressed vowels tend to be reduced to near-close vowels or an unclear schwa.
(See also: vowel reduction in Russian.)
The Russian syllable structure can be quite complex with both initial and final consonant clusters of up to 4 consecutive sounds.
Using a formula with V standing for the nucleus (vowel) and C for each consonant the structure can be described as follows:
(C)(C)(C)(C)V(C)(C)(C)(C)
Clusters of four consonants are not very common, however, especially within a morpheme.
Consonants
Russian is notable for its distinction based on palatalization of most of the consonants.
While do have palatalized allophones , only might be considered a phoneme, though it is marginal and generally not considered distinctive (the only native minimal pair which argues for to be a separate phoneme is "это ткёт"/"этот кот").
Palatalization means that the center of the tongue is raised during and after the articulation of the consonant.
In the case of , the tongue is raised enough to produce slight frication (affricate sounds).
These sounds: are dental, that is pronounced with the tip of the tongue against the teeth rather than against the alveolar ridge.
Grammar
Russian has preserved an Indo-European synthetic-inflectional structure, although considerable leveling has taken place.
Russian grammar encompasses
a highly synthetic morphology
a syntax that, for the literary language, is the conscious fusion of three elements:
a polished vernacular foundation;
a Church Slavonic inheritance;
a Western European style.
The spoken language has been influenced by the literary one, but continues to preserve characteristic forms.
The dialects show various non-standard grammatical features, some of which are archaisms or descendants of old forms since discarded by the literary language.
Vocabulary
See History of the Russian language for an account of the successive foreign influences on the Russian language.
The total number of words in Russian is difficult to reckon because of the ability to agglutinate and create manifold compounds, diminutives, etc. (see Word Formation under Russian grammar).
The number of listed words or entries in some of the major dictionaries published during the last two centuries, and the total vocabulary of Pushkin (who is credited with greatly augmenting and codifying literary Russian), are as follows:
(As a historical aside, Dahl was, in the second half of the nineteenth century, still insisting that the proper spelling of the adjective русский, which was at that time applied uniformly to all the Orthodox Eastern Slavic subjects of the Empire, as well as to its one official language, be spelled руский with one s, in accordance with ancient tradition and what he termed the "spirit of the language".
He was contradicted by the philologist Grot, who distinctly heard the s lengthened or doubled).
Proverbs and sayings
The Russian language is replete with many hundreds of proverbs (пословица ) and sayings (поговоркa ).
These were already tabulated by the seventeenth century, and collected and studied in the nineteenth and twentieth, with the folk-tales being an especially fertile source.
History and examples
The history of Russian language may be divided into the following periods.
Kievan period and feudal breakup
The Tatar yoke and the Grand Duchy of Lithuania
The Moscovite period (15th–17th centuries)
Empire (18th–19th centuries)
Soviet period and beyond (20th century)
Judging by the historical records, by approximately 1000 AD the predominant ethnic group over much of modern European Russia, Ukraine, and Belarus was the Eastern branch of the Slavs, speaking a closely related group of dialects.
The political unification of this region into Kievan Rus' in about 880, from which modern Russia, Ukraine and Belarus trace their origins, established Old East Slavic as a literary and commercial language.
It was soon followed by the adoption of Christianity in 988 and the introduction of the South Slavic Old Church Slavonic as the liturgical and official language.
Borrowings and calques from Byzantine Greek began to enter the Old East Slavic and spoken dialects at this time, which in their turn modified the Old Church Slavonic as well.
Dialectal differentiation accelerated after the breakup of Kievan Rus in approximately 1100.
On the territories of modern Belarus and Ukraine emerged Ruthenian and in modern Russia medieval Russian.
They definitely became distinct in 13th century by the time of division of that land between the Grand Duchy of Lithuania on the west and independent Novgorod Feudal Republic plus small duchies which were vassals of the Tatars on the east.
The official language in Moscow and Novgorod, and later, in the growing Moscow Rus’, was Church Slavonic which evolved from Old Church Slavonic and remained the literary language until the Petrine age, when its usage shrank drastically to biblical and liturgical texts.
Russian developed under a strong influence of the Church Slavonic until the close of the seventeenth century; the influence reversed afterwards leading to corruption of liturgical texts.
The political reforms of Peter the Great were accompanied by a reform of the alphabet, and achieved their goal of secularization and Westernization.
Blocks of specialized vocabulary were adopted from the languages of Western Europe.
By 1800, a significant portion of the gentry spoke French, less often German, on an everyday basis.
Many Russian novels of the 19th century, e.g. Lev Tolstoy’s "War and Peace", contain entire paragraphs and even pages in French with no translation given, with an assumption that educated readers won't need one.
The modern literary language is usually considered to date from the time of Aleksandr Pushkin in the first third of the nineteenth century.
Pushkin revolutionized Russian literature by rejecting archaic grammar and vocabulary (so called "высокий стиль" — "high style") in favor of grammar and vocabulary found in the spoken language of the time.
Even modern readers of younger age may only experience slight difficulties understanding some words in Pushkin’s texts, since only few words used by Pushkin became archaic or changed meaning.
On the other hand, many expressions used by Russian writers of the early 19th century, in particular Pushkin, Lermontov, Gogol, Griboiädov, became proverbs or sayings which can be frequently found even in the modern Russian colloquial speech.
The political upheavals of the early twentieth century and the wholesale changes of political ideology gave written Russian its modern appearance after the spelling reform of 1918.
Political circumstances and Soviet accomplishments in military, scientific, and technological matters (especially cosmonautics), gave Russian a world-wide prestige, especially during the middle third of the twentieth century.
Web search engine
A Web search engine is a search engine designed to search for information on the World Wide Web.
Information may consist of web pages, images and other types of files.
Some search engines also mine data available in newsbooks, databases, or open directories.
Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input.
History
Before there were search engines there was a complete list of all webservers.
The list was edited by Tim Berners-Lee and hosted on the CERN webserver.
One historical snapshot from 1992 remains.
As more and more webservers went online the central list could not keep up.
On the NCSA Site new servers were announced under the title "What's New!", but no complete listing existed any more.
The very first tool used for searching on the (pre-web) Internet was Archie.
The name stands for "archive" without the "v".
It was created in 1990 by Alan Emtage, a student at McGill University in Montreal.
The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites.
The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead.
Like Archie, they searched the file names and titles stored in Gopher index systems.
Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings.
Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers.
While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor.
The first Web search engine was Wandex, a now-defunct index collected by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993.
Another very early search engine, Aliweb, also appeared in 1993.
JumpStation (released in early 1994) used a crawler to find web pages for searching, but search was limited to the title of web pages only.
One of the first "full text" crawler-based search engines was WebCrawler, which came out in 1994.
Unlike its predecessors, it let users search for any word in any webpage, which became the standard for all major search engines since.
It was also the first one to be widely known by the public.
Also in 1994 Lycos (which started at Carnegie Mellon University) was launched, and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity.
These included Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista.
Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than full-text copies of web pages.
Information seekers could also browse the directory instead of doing a keyword-based search.
In 1996, Netscape was looking to give a single search engine an exclusive deal to be their featured search engine.
There was so much interest that instead a deal was struck with Netscape by 5 of the major search engines, where for $5Million per year each search engine would be in a rotation on the Netscape search engine page.
These five engines were: Yahoo!, Magellan, Lycos, Infoseek and Excite.
Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s.
Several companies entered the market spectacularly, receiving record gains during their initial public offerings.
Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light.
Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1999 and ended in 2001.
Around 2000, the Google search engine rose to prominence.
The company achieved better results for many searches with an innovation called PageRank.
This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others.
Google also maintained a minimalist interface to its search engine.
In contrast, many of its competitors embedded a search engine in a web portal.
By 2000, Yahoo was providing search services based on Inktomi's search engine.
Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003.
Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search (since re-branded Live Search) in the fall of 1998 using search results from Inktomi.
In early 1999 the site began to display listings from Looksmart blended with results from Inktomi except for a short time in 1999 when results from AltaVista were used instead.
In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).
As of late 2007, Google was by far the most popular Web search engine worldwide.
A number of country-specific search engine companies have become prominent; for example Baidu is the most popular search engine in the People's Republic of China and guruji.com in India.
How Web search engines work
A search engine operates, in the following order
Web crawling
Indexing
Searching
Web search engines work by storing information about many web pages, which they retrieve from the WWW itself.
These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees.
Exclusions can be made by the use of robots.txt.
The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags).
Data about web pages are stored in an index database for use in later queries.
Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find.
This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it.
This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage.
This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages.
Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.
Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query.
Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords.
The usefulness of a search engine depends on the relevance of the result set it gives back.
While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others.
Most search engines employ methods to rank the results to provide the "best" results first.
How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another.
The methods also change over time as Internet usage changes and new techniques evolve.
Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results.
Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results.
The search engines make money every time someone clicks on one of these ads.
The vast majority of search engines are run by private companies using proprietary algorithms and closed databases, though some are open source.
Revenue in the web search portals industry is projected to grow in 2008 by 13.4 percent, with broadband connections expected to rise by 15.1 percent.
Between 2008 and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still has some way to go to reach full saturation in American households.
Furthermore, broadband services are projected to account for an ever increasing share of domestic Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by fiber-optic and high speed cable lines.
Semantics
Semantics is the study of meaning in communication.
The word derives from Greek σημαντικός (semantikos), "significant", from σημαίνω (semaino), "to signify, to indicate" and that from σήμα (sema), "sign, mark, token".
In linguistics it is the study of interpretation of signs as used by agents or communities within particular circumstances and contexts.
It has related meanings in several other fields.
Semanticists differ on what constitutes meaning in an expression.
For example, in the sentence, "John loves a bagel", the word bagel may refer to the object itself, which is its literal meaning or denotation, but it may also refer to many other figurative associations, such as how it meets John's hunger, etc., which may be its connotation.
Traditionally, the formal semantic view restricts semantics to its literal meaning, and relegates all figurative associations to pragmatics, but this distinction is increasingly difficult to defend.
The degree to which a theorist subscribes to the literal-figurative distinction decreases as one moves from the formal semantic, semiotic, pragmatic, to the cognitive semantic traditions.
The word semantic in its modern sense is considered to have first appeared in French as sémantique in Michel Bréal's 1897 book, Essai de sémantique'.
In International Scientific Vocabulary semantics is also called semasiology.
The discipline of Semantics is distinct from Alfred Korzybski's General Semantics, which is a system for looking at non-immediate, or abstract meanings.
Linguistics
In linguistics, semantics is the subfield that is devoted to the study of meaning, as inherent at the levels of words, phrases, sentences, and even larger units of discourse (referred to as texts).
The basic area of study is the meaning of signs, and the study of relations between different linguistic units: homonymy, synonymy, antonymy, polysemy, paronyms, hypernymy, hyponymy, meronymy, metonymy, holonymy, exocentricity / endocentricity, linguistic compounds.
A key concern is how meaning attaches to larger chunks of text, possibly as a result of the composition from smaller units of meaning.
Traditionally, semantics has included the study of connotative sense and denotative reference, truth conditions, argument structure, thematic roles, discourse analysis, and the linkage of all of these to syntax.
Formal semanticists are concerned with the modeling of meaning in terms of the semantics of logic.
Thus the sentence John loves a bagel above can be broken down into its constituents (signs), of which the unit loves may serve as both syntactic and semantic head.
In the late 1960s, Richard Montague proposed a system for defining semantic entries in the lexicon in terms of lambda calculus.
Thus, the syntactic parse of the sentence above would now indicate loves as the head, and its entry in the lexicon would point to the arguments as the agent, John, and the object, bagel, with a special role for the article "a" (which Montague called a quantifier).
This resulted in the sentence being associated with the logical predicate loves (John, bagel), thus linking semantics to categorial grammar models of syntax.
The logical predicate thus obtained would be elaborated further, e.g. using truth theory models, which ultimately relate meanings to a set of Tarskiian universals, which may lie outside the logic.
The notion of such meaning atoms or primitives are basic to the language of thought hypothesis from the 70s.
Despite its elegance, Montague grammar was limited by the context-dependent variability in word sense, and led to several attempts at incorporating context, such as :
situation semantics ('80s): Truth-values are incomplete, they get assigned based on context
generative lexicon ('90s): categories (types) are incomplete, and get assigned based on context
The dynamic turn in semantics
In the Chomskian tradition in linguistics there was no mechanism for the learning of semantic relations, and the nativist view considered all semantic notions as inborn.
Thus, even novel concepts were proposed to have been dormant in some sense.
This traditional view was also unable to address many issues such as metaphor or associative meanings, and semantic change, where meanings within a linguistic community change over time, and qualia or subjective experience.
Another issue not addressed by the nativist model was how perceptual cues are combined in thought, e.g. in mental rotation.
This traditional view of semantics, as an innate finite meaning inherent in a lexical unit that can be composed to generate meanings for larger chunks of discourse, is now being fiercely debated in the emerging domain of cognitive linguistics and also in the non-Fodorian camp in Philosophy of Language.
The challenge is motivated by
factors internal to language, such as the problem of resolving indexical or anaphora (e.g. this x, him, last week).
In these situations "context" serves as the input, but the interpreted utterance also modifies the context, so it is also the output.
Thus, the interpretation is necessarily dynamic and the meaning of sentences is viewed as context-change potentials instead of propositions.
factors external to language, i.e. language is not a set of labels stuck on things, but "a toolbox, the importance of whose elements lie in the way they function rather than their attachments to things."
This view reflects the position of the later Wittgenstein and his famous game example, and is related to the positions of Quine, Davidson, and others.
A concrete example of the latter phenomenon is semantic underspecification — meanings are not complete without some elements of context.
To take an example of a single word, "red", its meaning in a phrase such as red book is similar to many other usages, and can be viewed as compositional.
However, the colours implied in phrases such as "red wine" (very dark), and "red hair" (coppery), or "red soil", or "red skin" are very different.
Indeed, these colours by themselves would not be called "red" by native speakers.
These instances are contrastive, so "red wine" is so called only in comparison with the other kind of wine (which also is not "white" for the same reasons).
This view goes back to de Saussure:
Each of a set of synonyms like redouter ('to dread'), craindre ('to fear'), avoir peur ('to be afraid') has its particular value only because they stand in contrast with one another.
No word has a value that can be identified independently of what else is in its vicinity.
and may go back to earlier Indian views on language, especially the Nyaya view of words as indicators and not carriers of meaning.
An attempt to defend a system based on propositional meaning for semantic underspecification can be found in the Generative Lexicon model of James Pustejovsky, who extends contextual operations (based on type shifting) into the lexicon.
Thus meanings are generated on the fly based on finite context.
Prototype theory
Another set of concepts related to fuzziness in semantics is based on prototypes.
The work of Eleanor Rosch and George Lakoff in the 1970s led to a view that natural categories are not characterizable in terms of necessary and sufficient conditions, but are graded (fuzzy at their boundaries) and inconsistent as to the status of their constituent members.
Systems of categories are not objectively "out there" in the world but are rooted in people's experience.
These categories evolve as learned concepts of the world — meaning is not an objective truth, but a subjective construct, learned from experience, and language arises out of the "grounding of our conceptual systems in shared embodiment and bodily experience".
A corollary of this is that the conceptual categories (i.e. the lexicon) will not be identical for different cultures, or indeed, for every individual in the same culture.
This leads to another debate (see the Whorf-Sapir hypothesis or Eskimo words for snow).
Computer science
In computer science, where it is considered as an application of mathematical logic, semantics reflects the meaning of programs or functions.
In this regard, semantics permits programs to be separated into their syntactical part (grammatical structure) and their semantic part (meaning).
For instance, the following statements use different syntaxes (languages), but result in the same semantic:
x += y; (C, Java, etc.)
x := x + y; (Pascal)
Let x = x + y; (early BASIC)
x = x + y (most BASIC dialects, Fortran)
Generally these operations would all perform an arithmetical addition of 'y' to 'x' and store the result in a variable 'x'.
Semantics for computer applications falls into three categories:
Operational semantics: The meaning of a construct is specified by the computation it induces when it is executed on a machine.
In particular, it is of interest how the effect of a computation is produced.
Denotational semantics: Meanings are modelled by mathematical objects that represent the effect of executing the constructs.
Thus only the effect is of interest, not how it is obtained.
Axiomatic semantics: Specific properties of the effect of executing the constructs as expressed as assertions.
Thus there may be aspects of the executions that are ignored.
The Semantic Web refers to the extension of the World Wide Web through the embedding of additional semantic metadata; s.a.
Web Ontology Language (OWL).
Psychology
In psychology, semantic memory is memory for meaning, in other words, the aspect of memory that preserves only the gist, the general significance, of remembered experience, while episodic memory is memory for the ephemeral details, the individual features, or the unique particulars of experience.
Word meaning is measured by the company they keep; the relationships among words themselves in a semantic network.
In a network created by people analyzing their understanding of the word (such as Wordnet) the links and decomposition structures of the network are few in number and kind; and include "part of", "kind of", and similar links.
In automated ontologies the links are computed vectors without explicit meaning.
Various automated technologies are being developed to compute the meaning of words: latent semantic indexing and support vector machines as well as natural language processing, neural networks and predicate calculus techniques.
Semantics has been reported to drive the course of psychotherapeutic interventions.
Language structure can determine the treatment approach to drug-abusing patients. .
While working in Europe for the US Information Agency, American psychiatrist, Dr. A. James Giannini reported semantic differences in medical approaches to addiction treatment..
English speaking countries used the term "drug dependence" to describe a rather passive pathology in their patients.
As a result the physician's role was more active.
Southern European countries such as Italy and Yugoslavia utilized the concept of "tossicomania" (i.e. toxic mania) to describe a more acive rather than passive role of the addict.
As a result the treating physician's role shifted to that of a more passive guide than that of an active interventionist. .
Sentence (linguistics)
In linguistics, a sentence is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, etc.
Sentences are generally characterized in most languages by the presence of a finite verb, e.g. "The quick brown fox jumps over the lazy dog".
Components of a sentence
A simple complete sentence consists of a subject and a predicate.
The subject is typically a noun phrase, though other kinds of phrases (such as gerund phrases) work as well, and some languages allow subjects to be omitted.
The predicate is a finite verb phrase: it's a finite verb together with zero or more objects, zero or more complements, and zero or more adverbials.
See also copula for the consequences of this verb on the theory of sentence structure.
Clauses
A clause consists of a subject and a verb.
There are two types of clauses: independent and subordinate (dependent).
An independent clause consists of a subject verb and also demonstrates a complete thought: for example, "I am sad."
A subordinate clause consists of a subject and a verb, but demonstrates an incomplete thought: for example, "Because I had to move."
Classification
By structure
One traditional scheme for classifying English sentences is by the number and types of finite clauses:
A simple sentence consists of a single independent clause with no dependent clauses.
A compound sentence consists of multiple independent clauses with no dependent clauses.
These clauses are joined together using conjunctions, punctuation, or both.
A complex sentence consists of one or more independent clauses with at least one dependent clause.
A complex-compound sentence (or compound-complex sentence) consists of multiple independent clauses, at least one of which has at least one dependent clause.
By purpose
Sentences can also be classified based on their purpose:
A declarative sentence or declaration, the most common type, commonly makes a statement: I am going home.
A negative sentence or negation denies that a statement is true: I am not going home.
An interrogative sentence or question is commonly used to request information — When are you going to work? — but sometimes not; see rhetorical question.
An exclamatory sentence or exclamation is generally a more emphatic form of statement: What a wonderful day this is!
Major and minor sentences
A major sentence is a regular sentence; it has a subject and a predicate.
For example: I have a ball.
In this sentence one can change the persons: We have a ball.
However, a minor sentence is an irregular type of sentence.
It does not contain a finite verb.
For example, "Mary!"
"Yes."
"Coffee." etc.
Other examples of minor sentences are headings (e.g. the heading of this entry), stereotyped expressions (Hello!), emotional expressions (Wow!), proverbs, etc.
This can also include sentences which do not contain verbs (e.g. The more, the merrier.) in order to intensify the meaning around the nouns (normally found in poetry and catchphrases) by Judee N..
Computer software
Computer software, or just software is a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system.
The term includes application software such as word processors which perform productive tasks for users, system software such as operating systems, which interface with hardware to provide the necessary services for application software, and middleware which controls and co-ordinates distributed systems.
"Software" is sometimes used in a broader context to mean anything which is not hardware but which is used with hardware, such as film, tapes and records.
Relationship to computer hardware
Computer software is so called to distinguish it from computer hardware, which encompasses the physical interconnections and devices required to store and execute (or run) the software.
At the lowest level, software consists of a machine language specific to an individual processor.
A machine language consists of groups of binary values signifying processor instructions which change the state of the computer from its preceding state.
Software is an ordered sequence of instructions for changing the state of the computer hardware in a particular sequence.
It is usually written in high-level programming languages that are easier and more efficient for humans to use (closer to natural language) than machine language.
High-level languages are compiled or interpreted into machine language object code.
Software may also be written in an assembly language, essentially, a mnemonic representation of a machine language using a natural language alphabet.
Assembly language must be assembled into object code via an assembler.
The term "software" was first used in this sense by John W. Tukey in 1958.
In computer science and software engineering, computer software is all computer programs.
The theory that is the basis for most modern software was first proposed by Alan Turing in his 1935 essay Computable numbers with an application to the Entscheidungsproblem.
Types
Practical computer systems divide software systems into three major classes: system software, programming software and application software, although the distinction is arbitrary, and often blurred.
System software helps run the computer hardware and computer system.
It includes operating systems, device drivers, diagnostic tools, servers, windowing systems, utilities and more.
The purpose of systems software is to insulate the applications programmer as much as possible from the details of the particular computer complex being used, especially memory and other hardware features, and such as accessory devices as communications, printers, readers, displays, keyboards, etc.
Programming software usually provides tools to assist a programmer in writing computer programs, and software using different programming languages in a more convenient way.
The tools include text editors, compilers, interpreters, linkers, debuggers, and so on.
An Integrated development environment (IDE) merges those tools into a software bundle, and a programmer may not need to type multiple commands for compiling, interpreting, debugging, tracing, and etc., because the IDE usually has an advanced graphical user interface, or GUI.
Application software allows end users to accomplish one or more specific (non-computer related) tasks.
Typical applications include industrial automation, business software, educational software, medical software, databases, and computer games.
Businesses are probably the biggest users of application software, but almost every field of human activity now uses some form of application software
Program and library
A program may not be sufficiently complete for execution by a computer.
In particular, it may require additional software from a software library in order to be complete.
Such a library may include software components used by stand-alone programs, but which cannot work on their own.
Thus, programs may include standard routines that are common to many programs, extracted from these libraries.
Libraries may also include 'stand-alone' programs which are activated by some computer event and/or perform some function (e.g., of computer 'housekeeping') but do not return data to their calling program.
Libraries may be called by one to many other programs; programs may call zero to many other programs.
Three layers
Users often see things differently than programmers.
People who use modern general purpose computers (as opposed to embedded systems, analog computers, supercomputers, etc.) usually see three layers of software performing a variety of tasks: platform, application, and user software.
Platform software:
Platform includes the firmware, device drivers, an operating system, and typically a graphical user interface which, in total, allow a user to interact with the computer and its peripherals (associated equipment).
Platform software often comes bundled with the computer.
On a PC you will usually have the ability to change the platform software.
Application software:
Application software or Applications are what most people think of when they think of software.
Typical examples include office suites and video games.
Application software is often purchased separately from computer hardware.
Sometimes applications are bundled with the computer, but that does not change the fact that they run as independent applications.
Applications are almost always independent programs from the operating system, though they are often tailored for specific platforms.
Most users think of compilers, databases, and other "system software" as applications.
User-written software:
End-user development tailors systems to meet users' specific needs.
User software include spreadsheet templates, word processor macros, scientific simulations, and scripts for graphics and animations.
Even email filters are a kind of user software.
Users create this software themselves and often overlook how important it is.
Depending on how competently the user-written software has been integrated into purchased application packages, many users may not be aware of the distinction between the purchased packages, and what has been added by fellow co-workers.
Creation
Operation
Computer software has to be "loaded" into the computer's storage (such as a hard drive, memory, or RAM).
Once the software has loaded, the computer is able to execute the software.
This involves passing instructions from the application software, through the system software, to the hardware which ultimately receives the instruction as machine code.
Each instruction causes the computer to carry out an operation -- moving data, carrying out a computation, or altering the control flow of instructions.
Data movement is typically from one place in memory to another.
Sometimes it involves moving data between memory and registers which enable high-speed data access in the CPU.
Moving data, especially large amounts of it, can be costly.
So, this is sometimes avoided by using "pointers" to data instead.
Computations include simple operations such as incrementing the value of a variable data element.
More complex computations may involve many operations and data elements together.
Instructions may be performed sequentially, conditionally, or iteratively.
Sequential instructions are those operations that are performed one after another.
Conditional instructions are performed such that different sets of instructions execute depending on the value(s) of some data.
In some languages this is known as an "if" statement.
Iterative instructions are performed repetitively and may depend on some data value.
This is sometimes called a "loop."
Often, one instruction may "call" another set of instructions that are defined in some other program or module.
When more than one computer processor is used, instructions may be executed simultaneously.
A simple example of the way software operates is what happens when a user selects an entry such as "Copy" from a menu.
In this case, a conditional instruction is executed to copy text from data in a 'document' area residing in memory, perhaps to an intermediate storage area known as a 'clipboard' data area.
If a different menu entry such as "Paste" is chosen, the software may execute the instructions to copy the text from the clipboard data area to a specific location in the same or another document in memory.
Depending on the application, even the example above could become complicated.
The field of software engineering endeavors to manage the complexity of how software operates.
This is especially true for software that operates in the context of a large or powerful computer system.
Currently, almost the only limitations on the use of computer software in applications is the ingenuity of the designer/programmer.
Consequently, large areas of activities (such as playing grand master level chess) formerly assumed to be incapable of software simulation are now routinely programmed.
The only area that has so far proved reasonably secure from software simulation is the realm of human art— especially, pleasing music and literature.
Kinds of software by operation: computer program as executable, source code or script, configuration.
Quality and reliability
Software reliability considers the errors, faults, and failures related to the design, implementation and operation of software.
See Software auditing, Software quality, Software testing, and Software reliability.
License
Software license gives the user the right to use the software in the licensed environment, some software comes with the license when purchased off the shelf, or an OEM license when bundled with hardware.
Other software comes with a free software licence, granting the recipient the rights to modify and redistribute the software.
Software can also be in the form of freeware or shareware.
See also License Management.
Patents
The issue of software patents is controversial.
Some believe that they hinder software development, while others argue that software patents provide an important incentive to spur software innovation.
See software patent debate.
Ethics and rights for software users
Being a new part of society, the idea of what rights users of software should have is not very developed.
Some, such as the free software community, believe that software users should be free to modify and redistribute the software they use.
They argue that these rights are necessary so that each individual can control their computer, and so that everyone can cooperate, if they choose, to work together as a community and control the direction that software progresses in.
Others believe that software authors should have the power to say what rights the user will get.
Software companies and non-profit organizations
Examples of non-profit software organizations : Free Software Foundation, GNU Project, Mozilla Foundation
Examples of large software companies are: Microsoft, IBM, Oracle, SAP and HP.
Spanish language
Spanish or Castilian (castellano) is an Indo-European, Romance language that originated in northern Spain, and gradually spread in the Kingdom of Castile and evolved into the principal language of government and trade.
It was taken to Africa, the Americas, and Asia Pacific with the expansion of the Spanish Empire between the fifteenth and nineteenth centuries.
Today, between 322 and 400 million people speak Spanish as a native language, making it the world's second most-spoken language by native speakers (after Mandarin Chinese).
Hispanosphere
It is estimated that the combined total of native and non-native Spanish speakers is approximately 500 million, likely making it the third most spoken language by total number of speakers (after English and Chinese).
Today, Spanish is an official language of Spain, most Latin American countries, and Equatorial Guinea; 21 nations speak it as their primary language.
Spanish also is one of six official languages of the United Nations.
Mexico has the world's largest Spanish-speaking population, and Spanish is the second most-widely spoken language in the United States and the most popular studied foreign language in U.S. schools and universities.
Global internet usage statistics for 2007 show Spanish as the third most commonly used language on the Internet, after English and Chinese.
Naming and origin
Spaniards tend to call this language (Spanish) when contrasting it with languages of other states, such as French and English, but call it (Castilian), that is, the language of the Castile region, when contrasting it with other languages spoken in Spain such as Galician, Basque, and Catalan.
This reasoning also holds true for the language's preferred name in some Hispanic American countries.
In this manner, the Spanish Constitution of 1978 uses the term to define the official language of the whole Spanish State, as opposed to (lit. the other Spanish languages).
Article III reads as follows:
The name castellano is, however, widely used for the language as a whole in Latin America.
Some Spanish speakers consider a generic term with no political or ideological links, much as "Spanish" is in English.
Often Latin Americans use it to differentiate their own variety of Spanish as opposed to the variety of Spanish spoken in Spain, or variety of Spanish which is considered as standard in the region.
Classification and related languages
Spanish is closely related to the other West Iberian Romance languages: Asturian (), Galician (), Ladino (), and Portuguese ().
Catalan, an East Iberian language which exhibits many Gallo-Romance traits, is more similar to the neighbouring Occitan language () than to Spanish, or indeed than Spanish and Portuguese are to each other.
Spanish and Portuguese share similar grammars and vocabulary as well as a common history of Arabic influence while a great part of the peninsula was under Islamic rule (both languages expanded over Islamic territories).
Their lexical similarity has been estimated as 89%.
See Differences between Spanish and Portuguese for further information.
Ladino
Ladino, which is essentially medieval Spanish and closer to modern Spanish than any other language, is spoken by many descendants of the Sephardi Jews who were expelled from Spain in the 15th century.
Ladino speakers are currently almost exclusively Sephardi Jews, with family roots in Turkey, Greece or the Balkans: current speakers mostly live in Israel and Turkey, with a few pockets in Latin America.
It lacks the Native American vocabulary which was influential during the Spanish colonial period, and it retains many archaic features which have since been lost in standard Spanish.
It contains, however, other vocabulary which is not found in standard Castilian, including vocabulary from Hebrew, some French, Greek and Turkish, and other languages spoken where the Sephardim settled.
Ladino is in serious danger of extinction because many native speakers today are elderly as well as elderly olim (immigrants to Israel) who have not transmitted the language to their children or grandchildren.
However, it is experiencing a minor revival among Sephardi communities, especially in music.
In the case of the Latin American communities, the danger of extinction is also due to the risk of assimilation by modern Castilian.
A related dialect is Haketia, the Judaeo-Spanish of northern Morocco.
This too tended to assimilate with modern Spanish, during the Spanish occupation of the region.
Vocabulary comparison
Spanish and Italian share a very similar phonological system.
At present, the lexical similarity with Italian is estimated at 82%.
As a result, Spanish and Italian are mutually intelligible to various degrees.
The lexical similarity with Portuguese is greater, 89%, but the vagaries of Portuguese pronunciation make it less easily understood by Hispanophones than Italian.
Mutual intelligibility between Spanish and French or Romanian is even lower (lexical similarity being respectively 75% and 71%): comprehension of Spanish by French speakers who have not studied the language is as low as an estimated 45% - the same as of English.
The common features of the writing systems of the Romance languages allow for a greater amount of interlingual reading comprehension than oral communication would.
1. also in early modern Portuguese (e.g. The Lusiads)
2. in Southern Italian dialects and languages
3. Alternatively
History
Spanish evolved from Vulgar Latin, with major influences from Arabic in vocabulary during the Andalusian period and minor surviving influences from Basque and Celtiberian, as well as Germanic languages via the Visigoths.
Spanish developed along the remote cross road strips among the Alava, Cantabria, Burgos, Soria and La Rioja provinces of Northern Spain, as a strongly innovative and differing variant from its nearest cousin, Leonese speech, with a higher degree of Basque influence in these regions (see Iberian Romance languages).
Typical features of Spanish diachronical phonology include lenition (Latin , Spanish ), palatalization (Latin , Spanish , and Latin , Spanish ) and diphthongation (stem-changing) of short e and o from Vulgar Latin (Latin , Spanish ; Latin , Spanish ).
Similar phenomena can be found in other Romance languages as well.
During the , this northern dialect from Cantabria was carried south, and remains a minority language in the northern coastal Morocco.
The first Latin-to-Spanish grammar () was written in Salamanca, Spain, in 1492, by Elio Antonio de Nebrija.
When it was presented to Isabel de Castilla, she asked, "What do I want a work like this for, if I already know the language?", to which he replied, "Your highness, the language is the instrument of the Empire."
From the 16th century onwards, the language was taken to the Americas and the Spanish East Indies via Spanish colonization.
In the 20th century, Spanish was introduced to Equatorial Guinea and the Western Sahara, the United States, such as in Spanish Harlem, in New York City, that had not been part of the Spanish Empire.
For details on borrowed words and other external influences upon Spanish, see Influences on the Spanish language.
Characterization
A defining characteristic of Spanish was the diphthongization of the Latin short vowels e and o into ie and ue, respectively, when they were stressed.
Similar sound changes are found in other Romance languages, but in Spanish they were significant.
Some examples:
Lat. > Sp. , It. , Fr. , Rom. , Port./Gal. "stone".
Lat. > Sp. , It. , Fr. / , Rom. , Port./Gal. "die".
Peculiar to early Spanish (as in the Gascon dialect of Occitan, and possibly due to a Basque substratum) was the mutation of Latin initial f- into h- whenever it was followed by a vowel that did not diphthongate.
Compare for instance:
Lat. > It. , Port. , Gal. , Fr. , Occitan (but Gascon ) Sp. (but Ladino );
Lat. > Lad. , Port./Gal. , Sp. ;
but Lat. > It. , Port./Gal. , Sp./Lad. .
Some consonant clusters of Latin also produced characteristically different results in these languages, for example:
Lat. , acc. , > Lad. , , ; Sp. , , .
However, in Spanish there are also the forms , , ; Port. , , ; Gal. , , .
Lat. acc. , , > Lad. , , ; Sp. , , ; Port. , , ; Gal. , , .
Geographic distribution
Spanish is one of the official languages of the European Union, the Organization of American States, the Organization of Ibero-American States, the United Nations, and the Union of South American Nations.
Europe
Spanish is an official language of Spain, the country for which it is named and from which it originated.
It is also spoken in Gibraltar, though English is the official language.
Likewise, it is spoken in Andorra though Catalan is the official language.
It is also spoken by small communities in other European countries, such as the United Kingdom, France, and Germany.
Spanish is an official language of the European Union.
In Switzerland, Spanish is the mother tongue of 1.7% of the population, representing the first minority after the 4 official languages of the country.
The Americas
Latin America
Most Spanish speakers are in Latin America; of most countries with the most Spanish speakers, only Spain is outside of the Americas.
Mexico has most of the world's native speakers.
Nationally, Spanish is the official language of Argentina, Bolivia (co-official Quechua and Aymara), Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, El Salvador, Guatemala, Honduras, Mexico , Nicaragua, Panama, Paraguay (co-official Guaraní), Peru (co-official Quechua and, in some regions, Aymara), Uruguay, and Venezuela.
Spanish is also the official language (co-official with English) in the U.S. commonwealth of Puerto Rico.
Spanish has no official recognition in the former British colony of Belize; however, per the 2000 census, it is spoken by 43% of the population.
Mainly, it is spoken by Hispanic descendants who remained in the region since the 17th century; however, English is the official language.
Spain colonized Trinidad and Tobago first in 1498, leaving the Carib people the Spanish language.
Also the Cocoa Panyols, laborers from Venezuela, took their culture and language with them; they are accredited with the music of "Parang" ("Parranda") on the island.
Because of Trinidad's location on the South American coast, the country is much influenced by its Spanish-speaking neighbors.
A recent census shows that more than 1,500 inhabitants speak Spanish.
In 2004, the government launched the Spanish as a First Foreign Language (SAFFL) initiative in March 2005.
Government regulations require Spanish to be taught, beginning in primary school, while thirty percent of public employees are to be linguistically competent within five years.
The government also announced that Spanish will be the country's second official language by 2020, beside English.
Spanish is important in Brazil because of its proximity to and increased trade with its Spanish-speaking neighbors; for example, as a member of the Mercosur trading bloc.
In 2005, the National Congress of Brazil approved a bill, signed into law by the President, making Spanish available as a foreign language in secondary schools.
In many border towns and villages (especially on the Uruguayan-Brazilian border), a mixed language known as Portuñol is spoken.
United States
In the 2006 census, 44.3 million people of the U.S. population were Hispanic or Latino by origin; 34 million people, 12.2 percent, of the population older than 5 years speak Spanish at home.
Spanish has a long history in the United States (many south-western states were part of Mexico and Spain), and it recently has been revitalized by much immigration from Latin America.
Spanish is the most widely taught foreign language in the country.
Although the United States has no formally designated "official languages," Spanish is formally recognized at the state level beside English; in the U.S. state of New Mexico, 30 per cent of the population speak it.
It also has strong influence in metropolitan areas such as Los Angeles, Miami and New York City.
Spanish is the dominant spoken language in Puerto Rico, a U.S. territory.
In total, the U.S. has the world's fifth-largest Spanish-speaking population.
Asia
Spanish was an official language of the Philippines but was never spoken by a majority of the population.
Movements for most of the masses to learn the language were started but were stopped by the friars.
Its importance fell in the first half of the 20th century following the U.S. occupation and administration of the islands.
The introduction of the English language in the Philippine government system put an end to the use of Spanish as the official language.
The language lost its official status in 1973 during the Ferdinand Marcos administration.
Spanish is spoken mainly by small communities of Filipino-born Spaniards, Latin Americans, and Filipino mestizos (mixed race), descendants of the early colonial Spanish settlers.
Throughout the 20th century, the Spanish language has declined in importance compared to English and Tagalog.
According to the 1990 Philippine census, there were 2,658 native speakers of Spanish.
No figures were provided during the 1995 and 2000 censuses; however, figures for 2000 did specify there were over 600,000 native speakers of Chavacano, a Spanish based creole language spoken in Cavite and Zamboanga.
Some other sources put the number of Spanish speakers in the Philippines around two to three million; however, these sources are disputed.
In Tagalog, there are 4,000 Spanish adopted words and around 6,000 Spanish adopted words in Visayan and other Philippine languages as well.
Today Spanish is offered as a foreign language in Philippines schools and universities.
Africa
In Africa, Spanish is official in the UN-recognised but Moroccan-occupied Western Sahara (co-official Arabic) and Equatorial Guinea (co-official French and Portuguese).
Today, nearly 200,000 refugee Sahrawis are able to read and write in Spanish, and several thousands have received university education in foreign countries as part of aid packages (mainly Cuba and Spain).
In Equatorial Guinea, Spanish is the predominant language when counting native and non-native speakers (around 500,000 people), while Fang is the most spoken language by a number of native speakers.
It is also spoken in the Spanish cities in continental North Africa (Ceuta and Melilla) and in the autonomous community of Canary Islands (143,000 and 1,995,833 people, respectively).
Within Northern Morocco, a former Franco-Spanish protectorate that is also geographically close to Spain, approximately 20,000 people speak Spanish.
It is spoken by some communities of Angola, because of the Cuban influence from the Cold War, and in Nigeria by the descendants of Afro-Cuban ex-slaves.
In Côte d'Ivoire and Senegal, Spanish can be learned as a second foreign language in the public education system.
In 2008, Cervantes Institutes centers will be opened in Lagos and Johannesburg, the first one in the Sub-Saharan Africa
Oceania
Among the countries and territories in Oceania, Spanish is also spoken in Easter Island, a territorial possession of Chile.
According to the 2001 census, there are approximately 95,000 speakers of Spanish in Australia, 44,000 of which live in Greater Sydney , where the older Mexican, Colombian, and Spanish populations and newer Argentine, Salvadoran and Uruguyan communities live.
The island nations of Guam, Palau, Northern Marianas, Marshall Islands and Federated States of Micronesia all once had Spanish speakers, since Marianas and Caroline Islands were Spanish colonial possessions until late 19th century (see Spanish-American War), but Spanish has since been forgotten.
It now only exists as an influence on the local native languages and also spoken by Hispanic American resident populations.
Dialectal variation
There are important variations among the regions of Spain and throughout Spanish-speaking America.
In countries in Hispanophone America, it is preferable to use the word castellano to distinguish their version of the language from that of Spain, thus asserting their autonomy and national identity.
In Spain the Castilian dialect's pronunciation is commonly regarded as the national standard, although a use of slightly different pronouns called [[Loísmo|]] of this dialect is deprecated.
More accurately, for nearly everyone in Spain, "standard Spanish" means "pronouncing everything exactly as it is written," an ideal which does not correspond to any real dialect, though the northern dialects are the closest to it.
In practice, the standard way of speaking Spanish in the media is "written Spanish" for formal speech, "Madrid dialect" (one of the transitional variants between Castilian and Andalusian) for informal speech.
Voseo
Spanish has three second-person singular pronouns: , , and in some parts of Latin America, (the use of this pronoun and/or its verb forms is called voseo).
In those regions where it is used, generally speaking, and are informal and used with friends; in other countries, is considered an archaic form.
is universally regarded as the formal address (derived from , "your grace"), and is used as a mark of respect, as when addressing one's elders or strangers.
is used extensively as the primary spoken form of the second-person singular pronoun, although with wide differences in social consideration, in many countries of Latin America, including Argentina, Chile, Costa Rica, the central mountain region of Ecuador, the State of Chiapas in Mexico, El Salvador, Guatemala, Honduras, Nicaragua, Paraguay, Uruguay, the Paisa region and Caleños of Colombia and the States of Zulia and Trujillo in Venezuela.
There are some differences in the verbal endings for vos in each country.
In Argentina, Uruguay, and increasingly in Paraguay and some Central American countries, it is also the standard form used in the media, but the media in other countries with generally continue to use or except in advertisements, for instance.
may also be used regionally in other countries.
Depending on country or region, usage may be considered standard or (by better educated speakers) to be unrefined.
Interpersonal situations in which the use of vos is acceptable may also differ considerably between regions.
Ustedes
Spanish forms also differ regarding second-person plural pronouns.
The Spanish dialects of Latin America have only one form of the second-person plural for daily use, (formal or familiar, as the case may be, though non-formal usage can sometimes appear in poetry and rhetorical or literary style).
In Spain there are two forms — (formal) and (familiar).
The pronoun is the plural form of in most of Spain, but in the Americas (and certain southern Spanish cities such as Cádiz or Seville, and in the Canary Islands) it is replaced with .
It is notable that the use of for the informal plural "you" in southern Spain does not follow the usual rule for pronoun-verb agreement; e.g., while the formal form for "you go", , uses the third-person plural form of the verb, in Cádiz or Seville the informal form is constructed as , using the second-person plural of the verb.
In the Canary Islands, though, the usual pronoun-verb agreement is preserved in most cases.
Some words can be different, even embarrassingly so, in different Hispanophone countries.
Most Spanish speakers can recognize other Spanish forms, even in places where they are not commonly used, but Spaniards generally do not recognise specifically American usages.
For example, Spanish mantequilla, aguacate and albaricoque (respectively, "butter", "avocado", "apricot") correspond to manteca, palta, and damasco, respectively, in Argentina, Chile and Uruguay.
The everyday Spanish words coger (to catch, get, or pick up), pisar (to step on) and concha (seashell) are considered extremely rude in parts of Latin America, where the meaning of coger and pisar is also "to have sex" and concha means "vulva".
The Puerto Rican word for "bobby pin" (pinche) is an obscenity in Mexico, and in Nicaragua simply means "stingy".
Other examples include taco, which means "swearword" in Spain but is known to the rest of the world as a Mexican dish.
Pija in many countries of Latin America is an obscene slang word for "penis", while in Spain the word also signifies "posh girl" or "snobby".
Coche, which means "car" in Spain, for the vast majority of Spanish-speakers actually means "baby-stroller", in Guatemala it means "pig", while carro means "car" in some Latin American countries and "cart" in others, as well as in Spain.
The (Royal Spanish Academy), together with the 21 other national ones (see Association of Spanish Language Academies), exercises a standardizing influence through its publication of dictionaries and widely respected grammar and style guides.
Due to this influence and for other sociohistorical reasons, a standardized form of the language (Standard Spanish) is widely acknowledged for use in literature, academic contexts and the media.
Writing system
Spanish is written using the Latin alphabet, with the addition of the character ñ (eñe, representing the phoneme , a letter distinct from n, although typographically composed of an n with a tilde) and the digraphs ch (, representing the phoneme ) and ll (, representing the phoneme ).
However, the digraph rr (, "strong r", , "double r", or simply ), which also represents a distinct phoneme , is not similarly regarded as a single letter.
Since 1994, the digraphs ch and ll are to be treated as letter pairs for collation purposes, though they remain a part of the alphabet.
Words with ch are now alphabetically sorted between those with ce and ci, instead of following cz as they used to, and similarly for ll.
Thus, the Spanish alphabet has the following 29 letters:
a, b, c, ch, d, e, f, g, h, i, j, k, l, ll, m, n, ñ, o, p, q, r, s, t, u, v, w, x, y, z.
With the exclusion of a very small number of regional terms such as México (see Toponymy of Mexico) and some neologisms like software, pronunciation can be entirely determined from spelling.
A typical Spanish word is stressed on the syllable before the last if it ends with a vowel (not including y) or with a vowel followed by n or s; it is stressed on the last syllable otherwise.
Exceptions to this rule are indicated by placing an acute accent on the stressed vowel.
The acute accent is used, in addition, to distinguish between certain homophones, especially when one of them is a stressed word and the other one is a clitic: compare ("the", masculine singular definite article) with ("he" or "it"), or ("you", object pronoun), (preposition "of" or "from"), and (reflexive pronoun) with ("tea"), ("give") and ("I know", or imperative "be").
The interrogative pronouns (, , , , etc.) also receive accents in direct or indirect questions, and some demonstratives (, , , etc.) must be accented when used as pronouns.
The conjunction ("or") is written with an accent between numerals so as not to be confused with a zero: e.g., should be read as rather than ("10,020").
Accent marks are frequently omitted in capital letters (a widespread practice in the early days of computers where only lowercase vowels were available with accents), although the RAE advises against this.
When u is written between g and a front vowel (e or i), if it should be pronounced, it is written with a diaeresis (ü) to indicate that it is not silent as it normally would be (e.g., cigüeña, "stork", is pronounced ; if it were written cigueña, it would be pronounced .
Interrogative and exclamatory clauses are introduced with inverted question ( ¿ ) and exclamation ( ¡ ) marks.
Sounds
The phonemic inventory listed in the following table includes phonemes that are preserved only in some dialects, other dialects having merged them (such as yeísmo); these are marked with an asterisk (*).
Sounds in parentheses are allophones.
By the 16th century, the consonant system of Spanish underwent the following important changes that differentiated it from neighboring Romance languages such as Portuguese and Catalan:
Initial , when it had evolved into a vacillating , was lost in most words (although this etymological h- is preserved in spelling and in some Andalusian dialects is still aspirated).
The bilabial approximant (which was written u or v) merged with the bilabial oclusive (written b).
There is no difference between the pronunciation of orthographic b and v in contemporary Spanish, excepting emphatic pronunciations that cannot be considered standard or natural.
The voiced alveolar fricative which existed as a separate phoneme in medieval Spanish merged with its voiceless counterpart .
The phoneme which resulted from this merger is currently spelled s.
The voiced postalveolar fricative merged with its voiceless counterpart , which evolved into the modern velar sound by the 17th century, now written with j, or g before e, i.
Nevertheless, in most parts of Argentina and in Uruguay, y and ll have both evolved to or .
The voiced alveolar affricate merged with its voiceless counterpart , which then developed into the interdental , now written z, or c before e, i.
But in Andalusia, the Canary Islands and the Americas this sound merged with as well.
See Ceceo, for further information.
The consonant system of Medieval Spanish has been better preserved in Ladino and in Portuguese, neither of which underwent these shifts.
Lexical stress
Spanish is a syllable-timed language, so each syllable has the same duration regardless of stress.
Stress most often occurs on any of the last three syllables of a word, with some rare exceptions at the fourth last.
The tendencies of stress assignment are as follows:
In words ending in vowels and , stress most often falls on the penultimate syllable.
In words ending in all other consonants, the stress more often falls on the ultimate syllable.
Preantepenultimate stress occurs rarely and only in words like guardándoselos ('saving them for him/her') where a clitic follows certain verbal forms.
In addition to the many exceptions to these tendencies, there are numerous minimal pairs which contrast solely on stress.
For example, sabana, with penultimate stress, means 'savannah' while , with antepenultimate stress, means 'sheet'; ('boundary'), ('[that] he/she limits') and ('I limited') also contrast solely on stress.
Phonological stress may be marked orthographically with an acute accent (ácido, distinción, etc).
This is done according to the mandatory stress rules of Spanish orthography which are similar to the tendencies above (differing with words like distinción) and are defined so as to unequivocally indicate where the stress lies in a given written word.
An acute accent may also be used to differentiate homophones (such as té for 'tea' and te
An amusing example of the significance of intonation in Spanish is the phrase
("What do you mean / 'how / do I eat'? / I eat / the way / I eat!").
Grammar
Spanish is a relatively inflected language, with a two-gender system and about fifty conjugated forms per verb, but limited inflection of nouns, adjectives, and determiners.
(For a detailed overview of verbs, see Spanish verbs and Spanish irregular verbs.)
It is right-branching, uses prepositions, and usually, though not always, places adjectives after nouns.
Its syntax is generally Subject Verb Object, though variations are common.
It is a pro-drop language (allows the deletion of pronouns when pragmatically unnecessary) and verb-framed.
Samples
Speech recognition
Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example, to keypresses, using the binary code for a string of character codes).
The term voice recognition may also be used to refer to speech recognition, but more precisely refers to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said.
Speech recognition applications include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), domotic appliance control and content-based spoken audio search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g., word processors or emails), and in aircraft cockpits (usually termed Direct Voice Input).
History
One of the most notable domains for the commercial application of speech recognition in the United States has been health care and in particular the work of the medical transcriptionist (MT).
According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than make the transcription process more efficient, hence it was not accepted.
It was also the case that SR at that time was often technically deficient.
Additionally, to be used effectively, it required changes to the ways physicians worked and documented clinical encounters, which many if not all were reluctant to do.
The biggest limitation to speech recognition automating transcription, however, is seen as the software.
The nature of narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but not yet by an automated system.
Another limitation has been the extensive amount of time required by the user and/or system provider to train the software.
A distinction in ASR is often made between "artificial syntax systems" which are usually domain-specific and "natural language processing" which is usually language-specific.
Each of these types of application presents its own particular goals and challenges.
Applications
Health care
In the health care domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete.
Many experts in the field anticipate that with increased use of speech recognition technology, the services provided may be redistributed rather than replaced.
Speech recognition can be implemented in front-end or back-end of the medical documentation process.
Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document.
It never goes through an MT/editor.
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report.
Deferred SR is being widely used in the industry currently.
Many Electronic Medical Records (EMR) applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine.
Searches, queries, and form filling may all be faster to perform by voice than by using a keyboard.
Military
High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft.
Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms.
In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays.
Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.
Some important conclusions from the work were as follows:
Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently.
Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful — with lower recognition rates, pilots would not use the system.
More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained.
Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft.
Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found recognition deteriorated with increasing G-loads.
It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly.
Contrary to what might be expected, no effects of the broken English of the speakers were found.
It was evident that spontaneous speech caused problems for the recognizer, as could be expected.
A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially.
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent system, i.e. it requires each pilot to create a template.
The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other cockpit functions.
Voice commands are confirmed by visual and/or aural feedback.
The system is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.
Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment.
The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone.
Substantial test and evaluation programs have been carried out in the post decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK.
Work in France has included speech recognition in the Puma helicopter.
There has also been much useful work in Canada.
Results have been encouraging, and voice applications have included: control of communication radios; setting of navigation systems; and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness.
Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment.
Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve performance improvements in operational settings.
Battle management
Battle management command centres generally require rapid access to and control of large, rapidly changing information databases.
Commanders and system operators need to query these databases as conveniently as possible, in an eyes-busy environment where much of the information is presented in a display format.
Human machine interaction by voice has the potential to be very useful in these environments.
A number of efforts have been undertaken to interface commercially available isolated-word recognizers into battle management environments.
In one feasibility study, speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications.
Users were very optimistic about the potential of the system, although capabilities were limited.
Speech understanding programs sponsored by the Defense Advanced Research Projects Agency (DARPA) in the U.S. has focused on this problem of natural speech interface..
Speech recognition efforts have focused on a database of continuous speech recognition (CSR), large-vocabulary speech which is designed to be representative of the naval resource management task.
Significant advances in the state-of-the-art in CSR have been achieved, and current efforts are focused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system.
Training air traffic controllers
Training for military (or civilian) air traffic controllers (ATC) represents an excellent application for speech recognition systems.
Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation.
Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel.
Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task.
The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition.
Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system.
However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications.
The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation.
Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed for the ATC task.
Research in France has focussed on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task-domain grammar constraints.
The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech recognition provided by Adacel Systems Inc (ASI).
Adacel's MaxSim software uses speech recognition and synthetic speech to enable the trainee to control aircraft and ground vehicles in the simulation without the need for pseudo pilots.
Adacel's ATC In A Box Software provideds a synthetic ATC environment for flight simulators.
The "real" pilot talks to a virtual controller using speech recognition and the virtual controller responds with synthetic speech.
It will be an application format
Telephony and other domains
ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread.
Despite the high level of integration with word processing in general personal computing, however, ASR in the field of document production has not seen the expected increases in use.
The improvement of mobile processor speeds let create speech-enabled Symbian and Windows Mobile Smartphones.
Current speech-to-text programs are too large and require too much CPU power to be practical for the Pocket PC.
Speech is used mostly as a part of User Interface, for creating pre-defined or custom speech commands.
Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command); Nuance Communications (Nuance Voice Control); Vito Technology (VITO Voice2Go); Speereo Software (Speereo Voice Translator).
People with Disabilities
People with disabilities are another part of the population that benefit from using speech recognition programs.
It is especially useful for people who have difficulty with or are unable to use their hands, from mild repetitive stress injuries to involved disabilities that require alternative input for support with accessing the computer.
In fact, people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition.
Speech recognition is used in deaf telephony, such as spinvox voice-to-text voicemail, relay services, and captioned telephone.
Further applications
Automatic translation
Automotive speech recognition (e.g., Ford Sync)
Telematics (e.g. vehicle Navigation Systems)
Court reporting (Realtime Voice Writing)
Hands-free computing: voice command recognition computer user interface
Home automation
Interactive voice response
Mobile telephony, including mobile email
Multimodal interaction
Pronunciation evaluation in computer-aided language learning applications
Robotics
Transcription (digital speech-to-text).
Speech-to-Text (Transcription of speech into mobile text messages)
Performance of speech recognition systems
The performance of speech recognition systems is usually specified in terms of accuracy and speed.
Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor.
Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).
Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions.
There is some confusion, however, over the interchangeability of the terms "speech recognition" and "dictation".
Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `enrollment') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy.
Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions.
`Optimal conditions' usually assume that users:
have speech characteristics which match the training data,
can achieve proper speaker adaptation, and
work in a clean noise environment (e.g. quiet office or laboratory space).
This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected.
Speech recognition in video has become a popular search technology used by several video search companies.
Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers.
Such systems are popular for routing incoming phone calls to their destinations in large organizations.
Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Hidden Markov models (HMMs) are widely used in many systems.
Language modeling has many other applications such as smart keyboard and document classification.
Hidden Markov model (HMM)-based speech recognition
Modern general-purpose speech recognition systems are generally based on HMMs.
These are statistical models which output a sequence of symbols or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal.
That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process.
Speech could thus be thought of as a Markov model for many stochastic processes.
Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use.
In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds.
The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients.
The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector.
Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition.
Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above.
A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation.
The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT).
Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data.
Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).
Dynamic time warping (DTW)-based speech recognition
Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach.
Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed.
For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation.
DTW has been applied to video, audio, and graphics – indeed, any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different speaking speeds.
In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other.
This sequence alignment method is often used in the context of hidden Markov models.
Further information
Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU.
Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing.
Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication.
Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993).
Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998).
Even more up to date is "Computer Speech", by Manfred R. Schroeder, second edition published in 2004.
A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).
In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting.
Another such resource is Carnegie Mellon University's SPHINX toolkit.
The AT&T libraries FSM Library, GRM library, and DCD library are also general software libraries for large-vocabulary speech recognition.
A useful review of the area of robustness in ASR is provided by Junqua and Haton (1995).
Speech synthesis
Speech synthesis is the artificial production of human speech.
A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware.
A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database.
Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity.
For specific usage domains, the storage of entire words or sentences allows for high-quality output.
Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood.
An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer.
Many computer operating systems have included speech synthesizers since the early 1980s.
Overview of text processing
A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end.
The front-end has two major tasks.
First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words.
This process is often called text normalization, pre-processing, or tokenization.
The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences.
The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.
Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end.
The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.
History
Long before electronic signal processing was invented, there were those who tried to build machines to create human speech.
Some early legends of the existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294).
In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation, they are , , , and ).
This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen of Vienna, Austria, described in a 1791 paper.
This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels.
In 1837, Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia".
Wheatstone's design was resurrected in 1923 by Paget.
In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible.
Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair.
The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950.
There were several different versions of this hardware device but only one currently survives.
The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound.
Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels).
Early electronic speech synthesizers sounded robotic and were often barely intelligible.
However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.
Electronic devices
The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968.
In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs.
Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews.
Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility.
Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey, where the HAL 9000 computer sings the same song as it is being put to sleep by astronaut Dave Bowman.
Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.
Synthesizer technologies
The most important qualities of a speech synthesis system are naturalness and Intelligibility.
Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood.
The ideal speech synthesizer is both natural and intelligible.
Speech synthesis systems usually try to maximize both characteristics.
The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant synthesis.
Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.
Concatenative synthesis
Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech.
Generally, concatenative synthesis produces the most natural-sounding synthesized speech.
However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output.
There are three main sub-types of concatenative synthesis.
Unit selection synthesis
Unit selection synthesis uses large databases of recorded speech.
During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences.
Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.
An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones.
At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection).
This process is typically achieved using a specially weighted decision tree.
Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech.
DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform.
The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned.
However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.
Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.
Diphone synthesis
Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language.
The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500.
In diphone synthesis, only one example of each diphone is contained in the speech database.
At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA.
The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers.
Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size.
As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.
Domain-specific synthesis
Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances.
It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.
The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators.
The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.
Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed.
The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account.
For example, in non-rhotic dialects of English the in words like is usually only pronounced when the following word has a vowel as its first letter (e.g. is realized as ).
Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison.
This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.
Formant synthesis
Formant synthesis does not use human speech samples at runtime.
Instead, the synthesized speech output is created using an acoustic model.
Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech.
This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.
Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech.
However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems.
Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems.
High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader.
Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples.
They can therefore be used in embedded systems, where memory and microprocessor power are especially limited.
Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.
Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines.
Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.
Articulatory synthesis
Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there.
The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein.
This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems.
A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted.
Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as gnuspeech.
The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".
HMM-based synthesis
HMM-based synthesis is a synthesis method based on hidden Markov models.
In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs.
Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion.
Sinewave synthesis
Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.
Challenges
Text normalization challenges
The process of normalizing text is rarely straightforward.
Texts are full of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation.
There are many spellings in English which are pronounced differently based on context.
For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".
Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective.
As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.
Deciding how to convert numbers is another problem that TTS systems have to address.
It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five."
However, numbers occur in many different contexts; when a year or part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a social security number, as "one three two five".
A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.
Similarly, abbreviations can be ambiguous.
For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street".
TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.
Text-to-phoneme challenges
Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language).
The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program.
Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.
The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings.
This is similar to the "sounding out", or synthetic phonics, approach to learning reading.
Each approach has advantages and drawbacks.
The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary.
As dictionary size grows, so too does the memory space requirements of the synthesis system.
On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations.
(Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].)
As a result, nearly all speech synthesis systems use a combination of these approaches.
Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful.
Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings.
On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries.
Evaluation challenges
It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data.
The quality of a speech synthesis system highly depends on the quality of recording.
Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills.
Recently researchers start evaluating speech synthesis systems using the common speech dataset.
This may help people to compare the difference between technologies rather than recordings.
Prosodics and emotional content
A recent study reported in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.
It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural.
Dedicated hardware
Votrax
SC-01A (analog formant)
SC-02 / SSI-263 / "Arctic 263"
General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
Magnevation SpeakJet (www.speechchips.com TTS256)
Savage Innovations SoundGin
National Semiconductor DT1050 Digitalker (Mozer)
Silicon Systems SSI 263 (analog formant)
Texas Instruments
TMS5110A (LPC)
TMS5200
Oki Semiconductor
MSM5205
MSM5218RS (ADPCM)
Toshiba T6721A
Philips PCF8200
Computer operating systems or outlets with speech synthesis
Apple
The first speech system integrated into an operating system was Apple Computer's MacInTalk in 1984.
Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software.
In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support.
With the introduction of faster PowerPC based computers they included higher quality voice sampling.
Apple also introduced speech recognition into its systems which provided a fluid command set.
More recently, Apple has added sample-based voices.
Starting as a curiosity, the speech system of Apple Macintosh has evolved into a cutting edge fully-supported program, PlainTalk, for people with vision problems.
VoiceOver was included in Mac OS Tiger and more recently Mac OS Leopard.
The voice shipping with Mac OS X 10.5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates.
AmigaOS
The second operating system with advanced speech synthesis capabilities was AmigaOS, introduced in 1985.
The voice synthesis was licensed by Commodore International from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga hardware audio chipset.
It was divided into a narrator device and a translator library.
Amiga Speak Handler featured a text-to-speech translator.
AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it.
Some Amiga programs, such as word processors, made extensive use of the speech system.
Microsoft Windows
Modern Windows systems use SAPI4- and SAPI5-based speech systems that include a speech recognition engine (SRE).
SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like Windows 95 and Windows 98.
Windows 2000 added a speech synthesis program called Narrator, directly available to users.
All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system.
Microsoft Speech Server is a complete package for voice synthesis and recognition, for commercial applications such as call centers.
Internet
Currently, there are a number of applications, plugins and gadgets that can read messages directly from an e-mail client and web pages from a web browser.
Some specialized software can narrate RSS-feeds.
On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts.
On the other hand, on-line RSS-readers are available on almost any PC connected to the Internet.
Users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work.
A growing field in internet based TTS technology is web-based assistive technology, e.g. Talklets.
This web based approach to a traditionally locally installed form of software application can afford many of those requiring software for accessibility reason, the ability to access web content from public machines, or those belonging to others.
While responsiveness is not as immediate as that of applications installed locally, the 'access anywhere' nature of it is the key benefit to this approach.
Others
Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral.
TI used a proprietary codec to embed complete spoken phrases into applications, primarily video games.
Systems that operate on free and open source software systems including GNU/Linux are various, and include open-source programs such as the Festival Speech Synthesis System which uses diphone-based synthesis (and can use a limited number of MBROLA voices), and gnuspeech which uses articulatory synthesis from the Free Software Foundation.
Other commercial vendor software also runs on GNU/Linux.
Several commercial companies are also developing speech synthesis systems (this list is reporting them just for the sake of information, not endorsing any specific product): Acapela Group, AT&T, Cepstral, DECtalk, IBM ViaVoice, IVONA TTS, Loquendo TTS, NeoSpeech TTS, Nuance Communications, Rhetorical Systems, SVOX and YAKiToMe!.
Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), Lernout & Hauspie (bankrupt), SpeechWorks (bought by Nuance)
Speech synthesis markup languages
A number of markup languages have been established for the rendition of text as speech in an XML-compliant format.
The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004.
Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE.
Although each of these was proposed as a standard, none of them has been widely adopted.
Speech synthesis markup languages are distinguished from dialogue markup languages.
VoiceXML, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.
Applications
Accessibility
Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and widespread.
It allows environmental barriers to be removed for people with a wide range of disabilities.
The longest application has been in the use of screenreaders for people with visual impairment, but text-to-speech systems are now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate youngsters.
They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid.
News service
Sites such as Ananova have used speech synthesis to convert written news to audio content, which can be used for mobile applications.
Entertainment
Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar.
In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications.
Software such as Vocaloid can generate singing voices via lyrics and melody.
This is also the aim of the Singing Computer project (which uses the GPL software Lilypond and Festival) to help blind people check their lyric input.
Statistical classification
Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.
Formally, the problem can be stated as follows: given training data produce a classifier which maps an object to its classification label .
For example, if the problem is filtering spam, then is some representation of an email and is either "Spam" or "Non-Spam".
Statistical classification algorithms are typically used in pattern recognition systems.
Note: in community ecology, the term "classification" is synonymous with what is commonly known (in machine learning) as clustering.
See that article for more information about purely unsupervised techniques.
The second problem is to consider classification as an estimation problem, where the goal is to estimate a function of the form
where the feature vector input is , and the function f is typically parameterized by some parameters .
In the Bayesian approach to this problem, instead of choosing a single parameter vector , the result is integrated over all possible thetas, with the thetas weighted by how likely they are given the training data D:
The third problem is related to the second, but the problem is to estimate the class-conditional probabilities and then use Bayes' rule to produce the class probability as in the second problem.
Examples of classification algorithms include:
Linear classifiers
Fisher's linear discriminant
Logistic regression
Naive Bayes classifier
Perceptron
Support vector machines
Quadratic classifiers
k-nearest neighbor
Boosting
Decision trees
Random forests
Neural networks
Bayesian networks
Hidden Markov models
An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
Van der Walt and Barnard (see reference section) investigated very specific artificial data sets to determine conditions under which certain classifiers perform better and worse than others.
Classifier performance depends greatly on the characteristics of the data to be classified.
There is no single classifier that works best on all given problems (a phenomenon that may be explained by the No-free-lunch theorem).
Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance.
Determining a suitable classifier for a given problem is however still more an art than a science.
The most widely used classifiers are the Neural Network (Multi-layer Perceptron), Support Vector Machines, k-Nearest Neighbours, Gaussian Mixture Model, Gaussian, Naive Bayes, Decision Tree and RBF classifiers.
Evaluation
The measures Precision and Recall are popular metrics used to evaluate the quality of a classification system.
More recently, Receiver Operating Characteristic (ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms.
Application domains
Computer vision
Medical Imaging and Medical Image Analysis
Optical character recognition
Geostatistics
Speech recognition
Handwriting recognition
Biometric identification
Natural language processing
Document classification
Internet search engines
Credit scoring
Statistical machine translation
Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora.
The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.
The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, including the ideas of applying Claude Shannon's information theory.
Statistical machine translation was re-introduced in 1991 by researchers at IBM's Thomas J. Watson Research Center and has contributed to the significant resurgence in interest in machine translation in recent years.
As of 2006, it is by far the most widely-studied machine translation paradigm.
Benefits
The benefits of statistical machine translation over traditional paradigms that are most often cited are the following:
Better use of resources
There is a great deal of natural language in machine-readable format.
Generally, SMT systems are not tailored to any specific pair of languages.
Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
More natural translations
The ideas behind statistical machine translation come out of information theory.
Essentially, the document is translated on the probability that a string in native language (for example, English) is the translation of a string in foreign language (for example, French).
Generally, these probabilities are estimated using techniques of parameter estimation.
The Bayes Theorem is applied to , the probability that the foreign string produces the native string to get , where the translation model is the probability that the native string is the translation of the foreign string, and the language model is the probability of seeing that native string.
Mathematically speaking, finding the best translation is done by picking up the one that gives the highest probability:
.
For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings in the native language.
Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality.
This trade-off between quality and time usage can also be found in speech recognition.
As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough.
Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word orders in the languages.
The statistical translation models were initially word based (Models 1-5 from IBM), but significant advances were made with the introduction of phrase based models.
Recent work has incorporated syntax or quasi-syntactic structures.
Word-based translation
In word-based translation, translated elements are words.
Typically, the number of words in translated sentences are different due to compound words, morphology and idioms.
The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces.
Simple word-based translation is not able to translate language pairs with fertility rates different from one.
To make word-based translation systems manage, for instance, high fertility rates, the system could be able to map a single word to multiple words, but not vice versa.
For instance, if we are translating from French to English, each word in English could produce zero or more French words.
But there's no way to group two English words producing a single French word.
An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes IBM models.
Phrase-based translation
In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ.
The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from the corpus.
Restricting the phrases to linguistic phrases has been shown to decrease translation quality.
Syntax-based translation
Challenges with statistical machine translation
Problems that statistical machine translation have to deal with include
Compound words
Idioms
Morphology
Different word orders
Word order in languages differ.
Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages.
There are also additional differences in word orders, for instance, where modifiers for nouns are located.
In Speech Recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order.
This is not always the case with the same text in two languages.
For SMT, the translation model is only able to translate small sequences of words and word order has to be taken into account somehow.
Typical solution has been re-ordering models, where a distribution of location changes for each item of translation is approximated from aligned bi-text.
Different location changes can be ranked with the help of the language model and the best can be selected.
Syntax
Out of vocabulary (OOV) words
SMT systems store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated.
Main reasons for out of vocabulary words are the limitation of training data, domain changes and morphology.
Statistics
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.
It is applicable to a wide variety of academic disciplines, from the natural and social sciences to the humanities, government and business.
Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics.
In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics.
Both descriptive and inferential statistics comprise applied statistics.
There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject.
The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in economic statistics, crime statistics, etc.
History
"Five men, Conring, Achenwall, Süssmilch, Graunt and Petty have been honored by different writers as the founder of statistics." claims one source (Willcox, Walter (1938) The Founder of Statistics.
Review of the International Statistical Institute 5(4):321-328.)
Some scholars pinpoint the origin of statistics to 1662, with the publication of "Observations on the Bills of Mortality" by John Graunt.
Early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data.
The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general.
Today, statistics is widely employed in government, business, and the natural and social sciences.
Because of its empirical roots and its applications, statistics is generally considered not to be a subfield of pure mathematics, but rather a distinct branch of applied mathematics.
Its mathematical foundations were laid in the 17th century with the development of probability theory by Pascal and Fermat.
Probability theory arose from the study of games of chance.
The method of least squares was first described by Carl Friedrich Gauss around 1794.
The use of modern computers has expedited large-scale statistical computation, and has also made possible new methods that are impractical to perform manually.
Overview
In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied.
This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period.
It may instead be a process observed at various times; data collected about this kind of "population" constitute what is called a time series.
For practical reasons, rather than compiling data about an entire population, one usually studies a chosen subset of the population, called a sample.
Data are collected about the sample in an observational or experimental setting.
The data are then subjected to statistical analysis, which serves two related purposes: description and inference.
Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample.
Basic examples of numerical descriptors include the mean and standard deviation.
Graphical summarizations include various kinds of charts and graphs.
Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population.
These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), descriptions of association (correlation), or modeling of relationships (regression).
Other modeling techniques include ANOVA, time series, and data mining.
The concept of correlation is particularly noteworthy.
Statistical analysis of a data set may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected.
For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people.
The two variables are said to be correlated (which is a positive correlation in this case).
However, one cannot immediately infer the existence of a causal relationship between the two variables.
(See Correlation does not imply causation.)
The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable.
If the sample is representative of the population, then inferences and conclusions made from the sample can be extended to the population as a whole.
A major problem lies in determining the extent to which the chosen sample is representative.
Statistics offers methods to estimate and correct for randomness in the sample and in the data collection procedure, as well as methods for designing robust experiments in the first place.
(See experimental design.)
The fundamental mathematical concept employed in understanding such randomness is probability.
Mathematical statistics (also called statistical theory) is the branch of applied mathematics that uses probability theory and analysis to examine the theoretical basis of statistics.
The use of any statistical method is valid only when the system or population under consideration satisfies the basic mathematical assumptions of the method.
Misuse of statistics can produce subtle but serious errors in description and interpretation — subtle in the sense that even experienced professionals sometimes make such errors, serious in the sense that they may affect, for instance, social policy, medical practice and the reliability of structures such as bridges.
Even when statistics is correctly applied, the results can be difficult for the non-expert to interpret.
For example, the statistical significance of a trend in the data, which measures the extent to which the trend could be caused by random variation in the sample, may not agree with one's intuitive sense of its significance.
The set of basic statistical skills (and skepticism) needed by people to deal with information in their everyday lives is referred to as statistical literacy.
Statistical methods
Experimental and observational studies
A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on response or dependent variables.
There are two major types of causal statistical studies, experimental studies and observational studies.
In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed.
The difference between the two types lies in how the study is actually conducted.
Each can be very effective.
An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements.
In contrast, an observational study does not involve experimental manipulation.
Instead, data are gathered and correlations between predictors and response are investigated.
An example of an experimental study is the famous Hawthorne studies, which attempted to test the changes to the working environment at the Hawthorne plant of the Western Electric Company.
The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers.
The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected the productivity.
It turned out that the productivity indeed improved (under the experimental conditions).
(See Hawthorne effect.)
However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindedness.
An example of an observational study is a study which explores the correlation between smoking and lung cancer.
This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis.
In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a case-control study, and then look for the number of cases of lung cancer in each group.
The basic steps of an experiment are;
Planning the research, including determining information sources, research subject selection, and ethical considerations for the proposed research and method.
Design of experiments, concentrating on the system model and the interaction of independent and dependent variables.
Summarizing a collection of observations to feature their commonality by suppressing details.
(Descriptive statistics)
Reaching consensus about what the observations tell about the world being observed.
(Statistical inference)
Documenting / presenting the results of the study.
Levels of measurement
See: Stanley Stevens' "Scales of measurement" (1946): nominal, ordinal, interval, ratio
There are four types of measurements or levels of measurement or measurement scales used in statistics: nominal, ordinal, interval, and ratio.
They have different degrees of usefulness in statistical research.
Ratio measurements have both a zero value defined and the distances between different measurements defined; they provide the greatest flexibility in statistical methods that can be used for analyzing the data.
Interval measurements have meaningful distances between measurements defined, but have no meaningful zero value defined (as in the case with IQ measurements or with temperature measurements in Fahrenheit).
Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values.
Nominal measurements have no meaningful rank order among values.
Since variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are called together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative or continuous variables due to their numerical nature.
Statistical techniques
Some well known statistical tests and procedures for research observations are:
Student's t-test
chi-square test
Analysis of variance (ANOVA)
Mann-Whitney U
Regression analysis
Factor Analysis
Correlation
Pearson product-moment correlation coefficient
Spearman's rank correlation coefficient
Time Series Analysis
Specialized disciplines
Some fields of inquiry use applied statistics so extensively that they have specialized terminology.
These disciplines include:
Actuarial science
Applied information economics
Biostatistics
Bootstrap & Jackknife Resampling
Business statistics
Data analysis
Data mining (applying statistics and pattern recognition to discover knowledge from data)
Demography
Economic statistics (Econometrics)
Energy statistics
Engineering statistics
Environmental Statistics
Epidemiology
Geography and Geographic Information Systems, more specifically in Spatial analysis
Image processing
Multivariate Analysis
Psychological statistics
Quality
Social statistics
Statistical literacy
Statistical modeling
Statistical surveys
Process analysis and chemometrics (for analysis of data from analytical chemistry and chemical engineering)
Structured data analysis (statistics)
Survival analysis
Reliability engineering
Statistics in various sports, particularly baseball and cricket
Statistics form a key basis tool in business and manufacturing as well.
It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions.
In these roles, it is a key tool, and perhaps the only reliable tool.
Statistical computing
The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science.
Early statistical models were almost always from the class of linear models, but powerful computers, coupled with suitable numerical algorithms, caused an increased interest in nonlinear models (especially neural networks and decision trees) as well as the creation of new types, such as generalised linear models and multilevel models.
Increased computing power has also led to the growing popularity of computationally-intensive methods based on resampling, such as permutation tests and the bootstrap, while techniques such as Gibbs sampling have made Bayesian methods more feasible.
The computer revolution has implications for the future of statistics with new emphasis on "experimental" and "empirical" statistics.
A large number of both general and special purpose statistical software are now available.
Misuse
There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter.
A famous saying attributed to Benjamin Disraeli is, "There are three kinds of lies: lies, damned lies, and statistics"; and Harvard President Lawrence Lowell wrote in 1909 that statistics, "like veal pies, are good if you know the person that made them, and are sure of the ingredients".
If various studies appear to contradict one another, then the public may come to distrust such studies.
For example, one study may suggest that a given diet or activity raises blood pressure, while another may suggest that it lowers blood pressure.
The discrepancy can arise from subtle variations in experimental design, such as differences in the patient groups or research protocols, that are not easily understood by the non-expert.
(Media reports sometimes omit this vital contextual information entirely.)
By choosing (or rejecting, or modifying) a certain sample, results can be manipulated.
Such manipulations need not be malicious or devious; they can arise from unintentional biases of the researcher.
The graphs used to summarize data can also be misleading.
Deeper criticisms come from the fact that the hypothesis testing approach, widely used and in many cases required by law or regulation, forces one hypothesis (the null hypothesis) to be "favored", and can also seem to exaggerate the importance of minor differences in large studies.
A difference that is highly statistically significant can still be of no practical significance.
(See criticism of hypothesis testing and controversy over the null hypothesis.)
One response is by giving a greater emphasis on the p-value than simply reporting whether a hypothesis is rejected at the given level of significance.
The p-value, however, does not indicate the size of the effect.
Another increasingly common approach is to report confidence intervals.
Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both the size of the effect and the uncertainty surrounding it.
Syntax
In linguistics, syntax (from Ancient Greek syn-, "together", and táxis, "arrangement") is the study of the principles and rules for constructing sentences in natural languages.
In addition to referring to the discipline, the term syntax is also used to refer directly to the rules and principles that govern the sentence structure of any individual language, as in "the syntax of Modern Irish".
Modern research in syntax attempts to describe languages in terms of such rules.
Many professionals in this discipline attempt to find general rules that apply to all natural languages.
The term syntax is also sometimes used to refer to the rules governing the behavior of mathematical systems, such as logic, artificial formal languages, and computer programming languages.
Early history
Works on grammar were being written long before modern syntax came about; the Aṣṭādhyāyī of Pāṇini is often cited as an example of a pre-modern work that approaches the sophistication of a modern syntactic theory.
In the West, the school of thought that came to be known as "traditional grammar" began with the work of Dionysius Thrax.
For centuries, work in syntax was dominated by a framework known as , first expounded in 1660 by Antoine Arnauld in a book of the same title.
This system took as its basic premise the assumption that language is a direct reflection of thought processes and therefore there is a single, most natural way to express a thought.
That way, coincidentally, was exactly the way it was expressed in French.
However, in the 19th century, with the development of historical-comparative linguistics, linguists began to realize the sheer diversity of human language, and to question fundamental assumptions about the relationship between language and logic.
It became apparent that there was no such thing as a most natural way to express a thought, and therefore logic could no longer be relied upon as a basis for studying the structure of language.
The Port-Royal grammar modeled the study of syntax upon that of logic (indeed, large parts of the Port-Royal Logic were copied or adapted from the Grammaire générale).
Syntactic categories were identified with logical ones, and all sentences were analyzed in terms of "Subject – Copula – Predicate".
Initially, this view was adopted even by the early comparative linguists such as Franz Bopp.
The central role of syntax within theoretical linguistics became clear only in the 20th century, which could reasonably be called the "century of syntactic theory" as far as linguistics is concerned.
For a detailed and critical survey of the history of syntax in the last two centuries, see the monumental work by Graffi (2001).
Modern theories
There are a number of theoretical approaches to the discipline of syntax.
Many linguists (e.g. Noam Chomsky) see syntax as a branch of biology, since they conceive of syntax as the study of linguistic knowledge as embodied in the human mind.
Others (e.g. Gerald Gazdar) take a more Platonistic view, since they regard syntax to be the study of an abstract formal system.
Yet others (e.g. Joseph Greenberg) consider grammar a taxonomical device to reach broad generalizations across languages.
Some of the major approaches to the discipline are listed below.
Generative grammar
The hypothesis of generative grammar is that language is a structure of the human mind.
The goal of generative grammar is to make a complete model of this inner language (known as i-language).
This model could be used to describe all human language and to predict the grammaticality of any given utterance (that is, to predict whether the utterance would sound correct to native speakers of the language).
This approach to language was pioneered by Noam Chomsky.
Most generative theories (although not all of them) assume that syntax is based upon the constituent structure of sentences.
Generative grammars are among the theories that focus primarily on the form of a sentence, rather than its communicative function.
Among the many generative theories of linguistics are:
Transformational Grammar (TG) (now largely out of date)
Government and binding theory (GB) (common in the late 1970s and 1980s)
Minimalism (MP) (the most recent Chomskyan version of generative grammar)
Other theories that find their origin in the generative paradigm are:
Generative semantics (now largely out of date)
Relational grammar (RG) (now largely out of date)
Arc Pair grammar
Generalized phrase structure grammar (GPSG; now largely out of date)
Head-driven phrase structure grammar (HPSG)
Lexical-functional grammar (LFG)
Categorial grammar
Categorial grammar is an approach that attributes the syntactic structure not to rules of grammar, but to the properties of the syntactic categories themselves.
For example, rather than asserting that sentences are constructed by a rule that combines a noun phrase (NP) and a verb phrase (VP) (e.g. the phrase structure rule S → NP VP), in categorial grammar, such principles are embedded in the category of the head word itself.
So the syntactic category for an intransitive verb is a complex formula representing the fact that the verb acts as a functor which requires an NP as an input and produces a sentence level structure as an output.
This complex category is notated as (NP\S) instead of V.
NP\S is read as " a category that searches to the left (indicated by \) for a NP (the element on the left) and outputs a sentence (the element on the right)".
The category of transitive verb is defined as an element that requires two NPs (its subject and its direct object) to form a sentence.
This is notated as (NP/(NP\S)) which means "a category that searches to the right (indicated by /) for an NP (the object), and generates a function (equivalent to the VP) which is (NP\S), which in turn represents a function that searches to the left for an NP and produces a sentence).
Tree-adjoining grammar is a categorial grammar that adds in partial tree structures to the categories.
Dependency grammar
Dependency grammar is a different type of approach in which structure is determined by the relations (such as grammatical relations) between a word (a head) and its dependents, rather than being based in constituent structure.
For example, syntactic structure is described in terms of whether a particular noun is the subject or agent of the verb, rather than describing the relations in terms of trees (one version of which is the parse tree) or other structural system.
Some dependency-based theories of syntax:
Algebraic syntax
Word grammar
Operator Grammar
Stochastic/probabilistic grammars/network theories
Theoretical approaches to syntax that are based upon probability theory are known as stochastic grammars.
One common implementation of such an approach makes use of a neural network or connectionism.
Some theories based within this approach are:
Optimality theory
Stochastic context-free grammar
Functionalist grammars
Functionalist theories, although focused upon form, are driven by explanation based upon the function of a sentence (i.e. its communicative function).
Some typical functionalist theories include:
Functional grammar (Dik)
Prague Linguistic Circle
Systemic functional grammar
Cognitive grammar
Construction grammar (CxG)
Role and reference grammar (RRG)
SYSTRAN
SYSTRAN, founded by Dr. Peter Toma in 1968, is one of the oldest machine translation companies.
SYSTRAN has done extensive work for the United States Department of Defense and the European Commission.
SYSTRAN provides the technology for Yahoo! and AltaVista's (Babel Fish) among others, but use of it was ended (circa 2007) for all of the language combinations offered by Google's language tools.
Commercial versions of SYSTRAN operate with operating systems Microsoft Windows (including Windows Mobile), Linux and Solaris.
History
With its origin in the Georgetown machine translation effort, SYSTRAN was one of the few machine translation systems to survive the major decrease of funding after the ALPAC Report of the mid-1960's.
The company was established in La Jolla, California to work on translation of Russian to English text for the United States Air Force during the "Cold War".
Large numbers of Russian scientific and technical documents were translated using SYSTRAN under the auspices of the USAF Foreign Technology Division (later the National Air and Space Intelligence Center) at Wright-Patterson Air Force Base, Ohio.
The quality of the translations, although only approximate, was usually adequate for understanding content.
The company was sold during 1986 to the Gachot family, based in Paris, France, and is now traded publicly by the French stock exchange.
It has a main office at the Grande Arche in La Defense and maintains a secondary office in La Jolla, San Diego, California.
Languages
Here is a list of the source and target languages SYSTRAN works with.
Many of the pairs are to or from English or French.
Russian into English (1968)
English into Russian (1973) for the Apollo-Soyuz project
English source (1975) for the European Commission
Arabic
Chinese
Danish
Dutch
French
German
Greek
Hindi
Italian
Japanese
Korean
Norwegian
Serbo-Croatian
Spanish
Swedish
Persian
Polish
Portuguese
Ukrainian
Urdu
Text analytics
The term text analytics describes a set of linguistic, lexical, pattern recognition, extraction, tagging/structuring, visualization, and predictive techniques.
The term also describes processes that apply these techniques, whether independently or in conjunction with query and analysis of fielded, numerical data, to solve business problems.
These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.
A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.
Current approaches to text analytics use natural language processing techniques that focus on specialized domains.
Typical subtasks are:
Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
Coreference: identification chains of noun phrases that refer to the same object.
For example, anaphora is a type of coreference.
Relationship Extraction: extraction of named relationships between entities in text